# Why is this tensor contraction so slow and how can I make it faster?

Posted on

Problem

I am trying to calculate a specific complicated tensor contraction (of complex valued tensors) which appears as part of some other calculation and is called multiple times in a row, about 100 times or maybe later even more. Therefore, performance is the key here.

For a 1D tensor (array) `A(N)`, a 3D tensor `B(N,N,N)` and a 4D tensor `C(2,N,N,N)`, where the parentheses show the shape of these tensors, this contraction is given by

``````out[0, i, j, k] = sum_l A[l] *      B[i, j, l]  * C[0, i, l, k] + B[i, j, k]
out[1, i, j, k] = sum_l A[l] * conj(B[i, l, j]) * C[1, i, l, k]
``````

where `conj(z)` refers to the complex conjugation and `out` is the output tensor of the same shape as `C`. In principle, these are of course two tensor contractions, each in the first indices of `out` and `C`.

Currently, I have two implementations for this contraction, one on C++ and one in python.
The C++ implementation uses arrays-of-arrays for the tensors:

``````template <typename T>
void contract(const int N, T *A, T ***B, T ****C, T ****out) {
#pragma omp parallel for collapse(3)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
{
T xx = B[i][j][k];
T yy = 0.0;

for (int ind = 0; ind < N; ind++)
{
xx += A[ind] * B[i][j][ind] * C[i][ind][k];
yy += A[ind] * conj(B[i][ind][j]) * C[i][ind][k];
}

out[i][j][k] = xx;
out[i][j][k] = yy;
}
}
``````

The python version utilizes `numpy` and `numba`:

``````import numba
@numba.njit(parallel=True)
def contract(A, B, C, out):
N = B.shape
for i, j, k in numba.pndindex(B.shape):
xx = B[i, j, k]
yy = 0.0j
for l in range(N):
xx += A[l] * B[i, j, l] * C[0, i, l, k]
yy += A[l] * numpy.conj(B[i, l, j]) * C[1, i, l, k]
out[0, i, j, k] = xx
out[1, i, j, k] = yy
``````

I suggest the following python code to compare the different versions in python and the analogous version in C++:

``````import numpy
import time
N = 200
A = numpy.ones(N, dtype=complex)
B = numpy.ones((N, N, N), dtype=complex) * (1.5 + 2.5j)
C = numpy.ones((2, N, N, N), dtype=complex)
C *= 2
C *= 3.5
out = numpy.empty((2, N, N, N), dtype=complex)
# Avoid delay due to compilation
contract(A, B, C, out)

start_time = time.perf_counter() # Begin time measurement
contract(A, B, C, out)
end_time = time.perf_counter() # End time measurement
run_time = end_time - start_time
print(run_time)

@numba.njit
def calc_res(out):
# This ist just some testing function
res = 0
for i in range(N):
for j in range(N):
for k in range(N):
res += (out[0, i, j, k] - out[1, i, j, k]) * (-1.0 * numpy.sign(1 + numpy.real(res)))
return res

print(calc_res(out))
``````

Which should print `3.588e9-2.202e10j` for N=200. This code should calculate this contraction quickly for up to N=600.

Currently, the best times I fould were ~0.77s for the python version and ~1.055s for the C++ version. It was already surprising, that the python/numba version is faster than C++. This is probably due the fact that these arrays are pointers-to-pointers-to…

Of course, when benchmarking, only the call(s) to the contraction function should be measured. However, you are free to modify this in any way, shape or form you want to obtain better results. I have a lot of freedom when it comes to the actual code. Here are several thoughts I had:

1. It might be possible to utilize the graphics card I have using numba CUDA, but I don’t have much experience with that. I had written a short test version but this causes a Memory Error for N = 600 because I tried to dump the arrays into the graphics card memory all at once. Is there maybe a way to remedy this and use CUDA (preferrably from python, and not directly from C++)?

2. Next, I thought that maybe changing around the axes of the arrays could improve the performance if SIMD instructions can access array elements close to each other faster, but that also did not seem to change anything.

3. One important step was using parallelization as you can already see from the code. I am using OMP in C++ and numba’s parallelization in the python version.

4. It is also possible to split this into the real output and the imaginary output part. But in my tests that did not change much. I thought this might improve speed since SIMD instructions might work better because the memory layout would be better.

You might make any appropriate change to the `contract` function or in general the code in between the two measurement lines.

Generally, I was quite surprised to find that these types of contractions are so slow. Surely, there must be some way to further improve the speed of this computation. It seems there are a lot of tools available when it comes to areas such as machine learning which I also discovered while searching for numba/CUDA.

The C++ code I am using for testing roughly looks as follows. I have omitted the actual implementation for a few functions for brevity, but these functions are probably not relevant for the speed and are not called in the time critical region.

``````#include <iostream>
#include <complex>
#include <chrono>

typedef std::complex<double> CX;

const CX I(0, 1);

const int N = 200;

int main() {

CX *A = new CX[N];
CX ***B = make_array_3d<CX>(N);
CX ****C = new CX***;
C = make_array_3d<CX>(N);
C = make_array_3d<CX>(N);

CX ****out = new CX***;
out = make_array_3d<CX>(N);
out = make_array_3d<CX>(N);

fill_1d<CX>(N, A, 1.0);
fill_3d<CX>(N, B, 1.5 + 2.5 * I);
fill_3d<CX>(N, C, 2.0);
fill_3d<CX>(N, C, 3.5);

contract<CX>(N, A, B, C, out);

// Output in microseconds
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;

// Calculate the result for testing purposes
CX res = 0.0;
for (int n = 0; n < N; n++)
for (int si = 0; si < N; si++)
for (int t = 0; t < N; t++)
res += (out[n][si][t] - out[n][si][t]) * (-1.0 * sgn(1 + real(res)));

std::cout << res << std::endl;

return 0;

}
``````

The call to the compiler for C++ is: `g++ test.cpp -O3 -fopenmp`. I also checked `-Ofast`, but that makes up for 0.1s at most.

Solution

Let’s look at just `out[0, i, j, k]` for now. The inner loop looks like `u += A[l]*B[i, j, l]*C[0, i, l, k]`.

What is its memory access pattern? `u` is local and all of `A` will fit in the cache. Just ignore them. This loop will be doing 2 memory loads (`B` and `C`) per iteration. Suppose `N = 600`. Between `B` and `C`, their combined sizes are `8 bytes/float * 2*600**3 .= 3.2 GB` of memory. So that’s definitely not cacheable. A typical cache is measured in megabytes—not gigabytes.

Now, for each iteration, how much actual computation is happening? I count two float multiplications and a float addition, for a total of three flops.

This means that this computation is going to be RAM bound—not CPU bound. To make this go faster, you should use the fastest RAM you have available to you. The RAM in a graphics card moves at 10x-100x the speed of the host RAM. So, I highly recommend running this on a GPU. Not because a GPU has a higher processing speed (which it does), but because it has a higher memory bandwidth.

The other half of this calculation is similar.