Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff

changed title from Fixes #2714: contractions fail with SEGV when comṕuting non linear scalars such as autodiff to Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff

changed the description

Very cool. Perhaps we should test with this?

https://gitlab.com/libeigen/eigen/-/blob/master/test/AnnoyingScalar.h

added 1 commit

7dcedd4c - fix multi-blocks/multi-slices + annoying tests

Compare with previous version

added 1 commit

63a3f7fe - fix commentary

Compare with previous version

resolved all threads

added 6 commits

8f858a4e - 1 commit from branch libeigen:master
abf3f02f - init & destroy scalars when RequireInitialization
0408f0d3 - added test for contraction on default device
93077496 - test contraction+requiredinitialization+threadpool
2dd9385e - fix multi-blocks/multi-slices + annoying tests
93217340 - fix commentary

Compare with previous version

added 7 commits

6b20ef53 - init & destroy scalars when RequireInitialization
5fb7889e - added test for contraction on default device
e6b7d44b - test contraction+requiredinitialization+threadpool
7dcedd4c - fix multi-blocks/multi-slices + annoying tests
63a3f7fe - fix commentary
e86ae6fc - init/finalizing scalars on shard threading
0c16e596 - Merge branch 'fix-contraction-autodiff' of...

Compare with previous version

Adding:

test_scalar_initialization_in_large_contraction
covering multithread sharding

added 1 commit

433d19f8 - cleaning comments

Compare with previous version

added 1 commit

1e5eac48 - Example of Tensor + Autodiff to train MLP network

Compare with previous version

Added an example of training an MLP network using only Eigen::Tensor and Eigen::AutoDiff:

template <typename T, typename Device>
T loop(const Device &device, const Eigen::Tensor<T, 2> &_TRUE, const Eigen::Tensor<T, 2> &_X, Eigen::Tensor<T, 2> &_W0,
       Eigen::Tensor<T, 2> &_W1, T learning_rate) {

  // converting tensors to autodiff
  Eigen::Tensor<AutoDiff_T, 2> TRUE = convert(_TRUE);
  Eigen::Tensor<AutoDiff_T, 2> X = convert(_X);
  Eigen::Tensor<AutoDiff_T, 2> W0 = convert(_W0, _W0.size() + _W1.size());
  Eigen::Tensor<AutoDiff_T, 2> W1 = convert(_W1, _W0.size() + _W1.size(), W0.size());

  // forward pass
  const Eigen::array<Eigen::IndexPair<int>, 1> contract_dims = {Eigen::IndexPair<int>(1, 0)};

  // Hidden layer
  Eigen::Tensor<AutoDiff_T, 2> Z0(X.dimension(0), W0.dimension(1));
  Z0.device(device) = X.contract(W0, contract_dims);
  Eigen::Tensor<AutoDiff_T, 2> Y0 = ReLU(Z0);

  // Output Layer
  Eigen::Tensor<AutoDiff_T, 2> Z1(Y0.dimension(0), W1.dimension(1));
  Z1 = Y0.contract(W1, contract_dims);
  auto Y1 = softmax(Z1);

  AutoDiff_T LOSS = categorical_cross_entropy(TRUE, Y1);

  // backward pass
  auto gradients = unpack_gradients(LOSS, W0, W1);

  auto grad0 = std::get<0>(gradients);
  auto grad1 = std::get<1>(gradients);

  // update pass
  _W0 = _W0 - grad0 * grad0.constant(learning_rate);
  _W1 = _W1 - grad1 * grad1.constant(learning_rate);

  T result = LOSS.value();

  return result;
}

Running the big cross compilation test https://gitlab.com/libeigen/eigen_ci_cross_testing/-/pipelines/1001885726 to see if there are any annoying compiler warnings or regressions. The fixes appear to be sensible and the code is easy to understand. @cantonios owns the tensor portions of Eigen, so I'll defer to his review.

The resulting memory handle may not be writable from the current host.

For example, on GPU, this allocate function will do a GPU cudaMalloc. What you really need is the Device itself to do the initialization, which on GPU will require a special kernel.

Hi @cantonios , thank you for the feedback. This MR was intended to fix contractions on default and TP devices. In my tests, contractions of AutoDiffScalar on GPU devices don't compile. Maybe fixing these contractions on GPU requires more sophisticated work.

For the case of TP/default devices, the suggestion here was to run the initialization/finalization in TensorContractionKernel because this looks like a good balance between type specialization and cohesion.

It turns out that making the device initialize/finalize by itself will require the device to know the operands layout. One way to achieve it without including a high coupling between the device type and the operation type is passing the operands layout as an optional parameter to device.allocate() or device.deallocate().

If I'm not missing something, my suggestion would be initializing/finalizing the scalars in TensorContractionKernel (or perhaps in TensorContractionBlockMemAllocator?). Once the contractions are running on TP/default devices, we can move to tackle the issues on GPU devices. What do you think?

We can't just hack it so that CPU contractions work. The system still needs to be designed consistently. Otherwise we'll need to redesign the whole thing in order to get the GPU (or other custom devices like sycl) to work.

The purpose of having a Device is to handle these device-specific differences. The device itself shouldn't need to know the layout - only the number of elements and type. We probably need something like

template<typename T>
T* allocate(size_t count);

template<typename T>
void deallocate(T* ptr);

that on CPU (including threadpool) will just call new/delete if the type requires initialization. Something similar to what we already do in Memory.h. You would then need to modify BlockMemAllocator to use this - that's the thing that controls the layout.

Ok, I got it. Totally agree. I will proceed with the changes.

changed this line in version 8 of the diff

Hi, as asked, I implemented the initialization of Scalars on the device:

struct DefaultDevice {
// ...
  template<typename T>
  void* allocate_elements(size_t num_elements) const {
    const size_t element_size = sizeof(T);
    size_t num_bytes = num_elements * element_size;        // calculating size
    size_t align = numext::maxi(EIGEN_MAX_ALIGN_BYTES, 1);
    num_bytes = divup<Index>(num_bytes, align) * align;    // aligning
    void * result = allocate(num_bytes);                   // allocating
    if (NumTraits<T>::RequireInitialization) {             // initializing when necessary
      char * mem_pos = reinterpret_cast<char*>(result);
      for (size_t i = 0; i < num_elements; ++i) {
        new(mem_pos)T();
        mem_pos += element_size;
      }
    }
    return result;
  }
//...
};

This code also passes the tests:

struct TensorContractionBlockMemAllocator {
  
  template <typename Device>
  static BlockMemHandle allocate(...) {

    BlockSizes sz = ComputeLhsRhsBlockSizes(bm, bk, bn); // calculates alignment, block and memory sizes
    char* block_mem = static_cast<char*>(d.template allocate_elements<LhsScalar>(sz.lhs_count + sz.rhs_count)); // allocating memory using only LhsScalar as type
    *lhs_block = static_cast<LhsScalar*>(static_cast<void*>(block_mem));
    *rhs_block = static_cast<RhsScalar*>(static_cast<void*>(block_mem + sz.lhs_size)); // sz.lhs_size is calculated by the host
    
    return block_mem;
  }
//...
};

Note that:

Since we are initializing the whole data using only one type, the memory is initialized using only LhsScalar.
On the host side, ComputeLhsRhsBlockSizes computes the memory size and alignment, which are device-dependent concerns in my understanding.

Because of the previous two issues, I implemented a secondary approach:

struct TensorContractionBlockMemAllocator {
  template <typename Device>
  EIGEN_DEVICE_FUNC static BlockMemHandle allocate(Device& d, const Index bm,
                                                   const Index bk,
                                                   const Index bn,
                                                   LhsScalar** lhs_block,
                                                   RhsScalar** rhs_block) {
    
    std::vector<void*> blocks;
    Index left_block_count = bm * bk;
    Index right_block_count = bn * bk;
    char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(left_block_count, right_block_count, blocks));
    *lhs_block = static_cast<LhsScalar*>(blocks[0]);
    *rhs_block = static_cast<RhsScalar*>(blocks[1]);

    return block_mem;
  }
};

This last version solves the two previous issues:

It uses both LhsScalar & RhsScalar to allocate/initialize the data.
The device does all the calculations of memory size and alignment.

We have a slightly more complex call for the ThreadPool use case. In the multithreaded contraction, we have a single memory block sharded into several parts to allow multiple threads to compute their jobs without blocking each other.

The following call covers this scenario:

template <typename LhsScalar, typename RhsScalar>
struct TensorContractionBlockMemAllocator {
  //...

  template <typename Device>
  static BlockMemHandle allocateSlices(
      Device& d, const Index bm, const Index bk, const Index bn,
      const Index num_lhs, const Index num_rhs, const Index num_slices,
      std::vector<LhsScalar*>* lhs_blocks,
      std::vector<RhsScalar*>* rhs_blocks) {
    
    std::vector<void*> blocks;
    char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(bm * bk, bn * bk, blocks, num_lhs, num_rhs, num_slices));

    Index blocks_index = 0;
    for (Index slice = 0; slice < num_slices; slice++) {
      if (num_lhs > 0) lhs_blocks[slice].resize(num_lhs);
      for (Index m = 0; m < num_lhs; m++) {
        void* block_memory = blocks[blocks_index++];
        lhs_blocks[slice][m] = static_cast<LhsScalar*>(block_memory);
      }
      if (num_rhs > 0) rhs_blocks[slice].resize(num_rhs);
      for (Index n = 0; n < num_rhs; n++) {
        void* block_memory = blocks[blocks_index++];
        rhs_blocks[slice][n] = static_cast<RhsScalar*>(block_memory);
      }
    }

    return block_mem;
  }
};

Note again that the memory positions in blocks are computed by the device.

The deallocation side uses the same ideas:

    ~EvalParallelContext() {
            
      kernel_.deallocate(device_, packed_mem_, nm0_, nn0_, std::min<Index>(nk_, P - 1));
      if (parallelize_by_sharding_dim_only_) {
        const int num_worker_threads = device_.numThreadsInPool();
        if (shard_by_col_) {
          Index num_blocks = num_worker_threads * gn_;
          kernel_.deallocate(device_, thread_local_pre_alocated_mem_, 0, num_blocks, 1);
        } else {
          Index num_blocks = num_worker_threads * gm_;
          kernel_.deallocate(device_, thread_local_pre_alocated_mem_, num_blocks, 0, 1);
        }
      }
    }

and

template <typename LhsScalar, typename RhsScalar>
struct TensorContractionBlockMemAllocator {
  //...

  template <typename Device>
  EIGEN_DEVICE_FUNC static void deallocate(Device& d, BlockMemHandle handle, const Index bm,
                                                   const Index bk,
                                                   const Index bn, 
                                                   const Index num_lhs = 1, 
                                                   const Index num_rhs = 1, 
                                                   const Index num_slices = 1) {

    d.template deallocate_blocks<LhsScalar, RhsScalar>(handle, bm * bk, bn * bk, num_lhs, num_rhs, num_slices);
  }
};

I included the two alternatives in the last commit. One can control which one to use by defining __USING_SINGLE_TYPE_CONTRACTIONS__.

Let me know if one of the alternatives works. If so, I will remove the other alternative and the flag USING_SINGLE_TYPE_CONTRACTIONS.

PS.: I have included a new test for the case of ThreadPool using a custom allocator.

Hi @cantonios , is there something I can help with this review? Based on your last feedback, I provided two code alternatives. Is any of them good? PS: sorry for asking, but this subject is very important to my current work!

I think we also need the size of the array for the deallocation so we can call the correct number of deconstructors. That’s why I made a custom device with the ptr and sizes of each array. A generic fix requires a lot of work.

added 1 commit

73946bcd - commit assumes type(LhsScalar) == type(RhsScalar)

Compare with previous version

added 1 commit

eb799e85 - moving initialization/finalizaation code to device

Compare with previous version

added 1 commit

7f42faa7 - blocking index in deallocating

Compare with previous version

added 4 commits

26cf340a - added test for contraction on default device
dc1436fa - test contraction+requiredinitialization+threadpool
be212566 - fix multi-blocks/multi-slices + annoying tests
506ca031 - moving initialization/finalizaation code to device

Compare with previous version

added 1 commit

f6cc6c0c - fix last commit

Compare with previous version

+          }
+        }
         slice += slice_mid * _LhsScalar_size + slice_end * _RhsScalar_size;
+      }
+    }
+  }
   template <typename Device>
   EIGEN_DEVICE_FUNC BlockMemHandle allocate(Device& d, LhsBlock* lhs_block,
                                             RhsBlock* rhs_block) {
     return BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
     BlockMemHandle result = BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
     initialize_block(result);

Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff

Reference issue

What does this implement/fix?

Tests

Additional information

Activity

Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff

Reference issue

What does this implement/fix?

Tests

Additional information

Merge request reports

Activity