Skip to content
Snippets Groups Projects

Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff

Reference issue

#2714

What does this implement/fix?

This MR is intended to fix a bug in Tensor contractions when using non-linear scalars or scalars that need initialization/finalization. The issue affects contractions running on DefautlDevice and ThreadPoolDevice, causing segmentation faults and memory leaks.

It turns out that the current contraction code uses a raw memory chuck to allocate the operands, correctly deallocating this memory block after the computation. However, with exotic scalars such as Eigen::AutodiffScalar, each element is eventually a graph with pointers to other memory positions that need (or do not) be initialized/finalized before/after the contraction computation.

The raw memory approach is relevant for the sake of performance, providing an aligned memory that allows vectorization.

This MR fixes this issue by performing two additional steps:

  • initializing the scalars by explicitly calling their constructors
  • finalizing the scalars by explicitly invoking the destructor

It is noteworthy that these two steps shouldn't allocate/deallocate the memory by themselves. The device already allocated the memory. Thus, we cannot allocate it again. In the same way, we cannot deallocate the memory by ourselves since the device will deallocate it later on using the right aligning offset.

As a solution, the fix uses placement new and explicit destructor calls. These two actions are implemented as features in the TensorContractionKernel struct:

template <typename ResScalar, typename LhsScalar, typename RhsScalar,
    typename StorageIndex, typename OutputMapper, typename LhsMapper,
    typename RhsMapper>
struct TensorContractionKernel {
  // ...

  EIGEN_DEVICE_FUNC void initialize_block(BlockMemHandle block) {
    //...
  }

  template <typename Device>
  EIGEN_DEVICE_FUNC BlockMemHandle allocate(Device& d, LhsBlock* lhs_block, RhsBlock* rhs_block) {
    BlockMemHandle result = BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
    initialize_block(result);
    return result;
  }

  template <typename Device>
  EIGEN_DEVICE_FUNC BlockMemHandle allocateSlices(
      Device& d, const StorageIndex num_lhs, const StorageIndex num_rhs,
      const StorageIndex num_slices, std::vector<LhsBlock>* lhs_blocks,
      std::vector<RhsBlock>* rhs_blocks) {

    BlockMemHandle result = BlockMemAllocator::allocateSlices(
        d, bm, bk, bn, num_lhs, num_rhs, num_slices, lhs_blocks, rhs_blocks);    
    initialize_block(result);
    return result;
  }

  EIGEN_DEVICE_FUNC void finalize_block(BlockMemHandle block) {
    // ...
  }

  template <typename Device>
  EIGEN_DEVICE_FUNC void deallocate(Device& d, BlockMemHandle handle) {
    finalize_block (handle);
    BlockMemAllocator::deallocate(d, handle);
  }

Tests

This MR includes two new tests: test_scalar_initialization (in cxx11_tensor_contraction.cpp) and test_multithread_contraction_with_scalar_initialization (in cxx11_tensor_thread_pool.cpp). Both tests use a simple Scalar InitializableScalar.

Additional information

  1. Contractions of raw scalars such as floats or doubles are not affected by the initialization/finalization methods because this fix checks for NumTraits<LhsScalar>::RequireInitialization.

  2. Before this MR and even using this MR, contractions of exotic scalars do not compile on GPU devices.

Edited by Luiz doleron

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
262 }
263 }
264
265 slice += slice_mid * _LhsScalar_size + slice_end * _RhsScalar_size;
266 }
267
268 }
269
270 }
271
220 272 template <typename Device>
221 273 EIGEN_DEVICE_FUNC BlockMemHandle allocate(Device& d, LhsBlock* lhs_block,
222 274 RhsBlock* rhs_block) {
223 return BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
275 BlockMemHandle result = BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
276 initialize_block(result);
  • The resulting memory handle may not be writable from the current host.

    For example, on GPU, this allocate function will do a GPU cudaMalloc. What you really need is the Device itself to do the initialization, which on GPU will require a special kernel.

  • Hi @cantonios , thank you for the feedback. This MR was intended to fix contractions on default and TP devices. In my tests, contractions of AutoDiffScalar on GPU devices don't compile. Maybe fixing these contractions on GPU requires more sophisticated work.

    For the case of TP/default devices, the suggestion here was to run the initialization/finalization in TensorContractionKernel because this looks like a good balance between type specialization and cohesion.

    It turns out that making the device initialize/finalize by itself will require the device to know the operands layout. One way to achieve it without including a high coupling between the device type and the operation type is passing the operands layout as an optional parameter to device.allocate() or device.deallocate().

    If I'm not missing something, my suggestion would be initializing/finalizing the scalars in TensorContractionKernel (or perhaps in TensorContractionBlockMemAllocator?). Once the contractions are running on TP/default devices, we can move to tackle the issues on GPU devices. What do you think?

  • We can't just hack it so that CPU contractions work. The system still needs to be designed consistently. Otherwise we'll need to redesign the whole thing in order to get the GPU (or other custom devices like sycl) to work.

    The purpose of having a Device is to handle these device-specific differences. The device itself shouldn't need to know the layout - only the number of elements and type. We probably need something like

    template<typename T>
    T* allocate(size_t count);
    
    template<typename T>
    void deallocate(T* ptr);

    that on CPU (including threadpool) will just call new/delete if the type requires initialization. Something similar to what we already do in Memory.h. You would then need to modify BlockMemAllocator to use this - that's the thing that controls the layout.

  • Ok, I got it. Totally agree. I will proceed with the changes.

  • Luiz doleron changed this line in version 8 of the diff

    changed this line in version 8 of the diff

  • Hi, as asked, I implemented the initialization of Scalars on the device:

    struct DefaultDevice {
    // ...
      template<typename T>
      void* allocate_elements(size_t num_elements) const {
        const size_t element_size = sizeof(T);
        size_t num_bytes = num_elements * element_size;        // calculating size
        size_t align = numext::maxi(EIGEN_MAX_ALIGN_BYTES, 1);
        num_bytes = divup<Index>(num_bytes, align) * align;    // aligning
        void * result = allocate(num_bytes);                   // allocating
        if (NumTraits<T>::RequireInitialization) {             // initializing when necessary
          char * mem_pos = reinterpret_cast<char*>(result);
          for (size_t i = 0; i < num_elements; ++i) {
            new(mem_pos)T();
            mem_pos += element_size;
          }
        }
        return result;
      }
    //...
    };

    This code also passes the tests:

    struct TensorContractionBlockMemAllocator {
      
      template <typename Device>
      static BlockMemHandle allocate(...) {
    
        BlockSizes sz = ComputeLhsRhsBlockSizes(bm, bk, bn); // calculates alignment, block and memory sizes
        char* block_mem = static_cast<char*>(d.template allocate_elements<LhsScalar>(sz.lhs_count + sz.rhs_count)); // allocating memory using only LhsScalar as type
        *lhs_block = static_cast<LhsScalar*>(static_cast<void*>(block_mem));
        *rhs_block = static_cast<RhsScalar*>(static_cast<void*>(block_mem + sz.lhs_size)); // sz.lhs_size is calculated by the host
        
        return block_mem;
      }
    //...
    };

    Note that:

    • Since we are initializing the whole data using only one type, the memory is initialized using only LhsScalar.
    • On the host side, ComputeLhsRhsBlockSizes computes the memory size and alignment, which are device-dependent concerns in my understanding.

    Because of the previous two issues, I implemented a secondary approach:

    struct TensorContractionBlockMemAllocator {
      template <typename Device>
      EIGEN_DEVICE_FUNC static BlockMemHandle allocate(Device& d, const Index bm,
                                                       const Index bk,
                                                       const Index bn,
                                                       LhsScalar** lhs_block,
                                                       RhsScalar** rhs_block) {
        
        std::vector<void*> blocks;
        Index left_block_count = bm * bk;
        Index right_block_count = bn * bk;
        char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(left_block_count, right_block_count, blocks));
        *lhs_block = static_cast<LhsScalar*>(blocks[0]);
        *rhs_block = static_cast<RhsScalar*>(blocks[1]);
    
        return block_mem;
      }
    };

    This last version solves the two previous issues:

    • It uses both LhsScalar & RhsScalar to allocate/initialize the data.
    • The device does all the calculations of memory size and alignment.

    We have a slightly more complex call for the ThreadPool use case. In the multithreaded contraction, we have a single memory block sharded into several parts to allow multiple threads to compute their jobs without blocking each other.

    The following call covers this scenario:

    template <typename LhsScalar, typename RhsScalar>
    struct TensorContractionBlockMemAllocator {
      //...
    
      template <typename Device>
      static BlockMemHandle allocateSlices(
          Device& d, const Index bm, const Index bk, const Index bn,
          const Index num_lhs, const Index num_rhs, const Index num_slices,
          std::vector<LhsScalar*>* lhs_blocks,
          std::vector<RhsScalar*>* rhs_blocks) {
        
        std::vector<void*> blocks;
        char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(bm * bk, bn * bk, blocks, num_lhs, num_rhs, num_slices));
    
        Index blocks_index = 0;
        for (Index slice = 0; slice < num_slices; slice++) {
          if (num_lhs > 0) lhs_blocks[slice].resize(num_lhs);
          for (Index m = 0; m < num_lhs; m++) {
            void* block_memory = blocks[blocks_index++];
            lhs_blocks[slice][m] = static_cast<LhsScalar*>(block_memory);
          }
          if (num_rhs > 0) rhs_blocks[slice].resize(num_rhs);
          for (Index n = 0; n < num_rhs; n++) {
            void* block_memory = blocks[blocks_index++];
            rhs_blocks[slice][n] = static_cast<RhsScalar*>(block_memory);
          }
        }
    
        return block_mem;
      }
    };

    Note again that the memory positions in blocks are computed by the device.

    The deallocation side uses the same ideas:

        ~EvalParallelContext() {
                
          kernel_.deallocate(device_, packed_mem_, nm0_, nn0_, std::min<Index>(nk_, P - 1));
          if (parallelize_by_sharding_dim_only_) {
            const int num_worker_threads = device_.numThreadsInPool();
            if (shard_by_col_) {
              Index num_blocks = num_worker_threads * gn_;
              kernel_.deallocate(device_, thread_local_pre_alocated_mem_, 0, num_blocks, 1);
            } else {
              Index num_blocks = num_worker_threads * gm_;
              kernel_.deallocate(device_, thread_local_pre_alocated_mem_, num_blocks, 0, 1);
            }
          }
        }

    and

    template <typename LhsScalar, typename RhsScalar>
    struct TensorContractionBlockMemAllocator {
      //...
    
      template <typename Device>
      EIGEN_DEVICE_FUNC static void deallocate(Device& d, BlockMemHandle handle, const Index bm,
                                                       const Index bk,
                                                       const Index bn, 
                                                       const Index num_lhs = 1, 
                                                       const Index num_rhs = 1, 
                                                       const Index num_slices = 1) {
    
        d.template deallocate_blocks<LhsScalar, RhsScalar>(handle, bm * bk, bn * bk, num_lhs, num_rhs, num_slices);
      }
    };

    I included the two alternatives in the last commit. One can control which one to use by defining __USING_SINGLE_TYPE_CONTRACTIONS__.

    Let me know if one of the alternatives works. If so, I will remove the other alternative and the flag USING_SINGLE_TYPE_CONTRACTIONS.

    PS.: I have included a new test for the case of ThreadPool using a custom allocator.

    Edited by Luiz doleron
  • Hi @cantonios , is there something I can help with this review? Based on your last feedback, I provided two code alternatives. Is any of them good? PS: sorry for asking, but this subject is very important to my current work!

  • Please register or sign in to reply
    • Resolved by Luiz doleron

      I think we also need the size of the array for the deallocation so we can call the correct number of deconstructors. That’s why I made a custom device with the ptr and sizes of each array. A generic fix requires a lot of work.

  • Luiz doleron added 1 commit

    added 1 commit

    • 73946bcd - commit assumes type(LhsScalar) == type(RhsScalar)

    Compare with previous version

  • Luiz doleron added 1 commit

    added 1 commit

    • eb799e85 - moving initialization/finalizaation code to device

    Compare with previous version

  • Luiz doleron added 1 commit

    added 1 commit

    • 7f42faa7 - blocking index in deallocating

    Compare with previous version

  • Luiz doleron added 4 commits

    added 4 commits

    • 26cf340a - added test for contraction on default device
    • dc1436fa - test contraction+requiredinitialization+threadpool
    • be212566 - fix multi-blocks/multi-slices + annoying tests
    • 506ca031 - moving initialization/finalizaation code to device

    Compare with previous version

  • Luiz doleron added 1 commit

    added 1 commit

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading