Problem
I am trying to calculate a specific complicated tensor contraction (of complex valued tensors) which appears as part of some other calculation and is called multiple times in a row, about 100 times or maybe later even more. Therefore, performance is the key here.
For a 1D tensor (array) A(N)
, a 3D tensor B(N,N,N)
and a 4D tensor C(2,N,N,N)
, where the parentheses show the shape of these tensors, this contraction is given by
out[0, i, j, k] = sum_l A[l] * B[i, j, l] * C[0, i, l, k] + B[i, j, k]
out[1, i, j, k] = sum_l A[l] * conj(B[i, l, j]) * C[1, i, l, k]
where conj(z)
refers to the complex conjugation and out
is the output tensor of the same shape as C
. In principle, these are of course two tensor contractions, each in the first indices of out
and C
.
Currently, I have two implementations for this contraction, one on C++ and one in python.
The C++ implementation uses arraysofarrays for the tensors:
template <typename T>
void contract(const int N, T *A, T ***B, T ****C, T ****out) {
#pragma omp parallel for collapse(3)
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
{
T xx = B[i][j][k];
T yy = 0.0;
for (int ind = 0; ind < N; ind++)
{
xx += A[ind] * B[i][j][ind] * C[0][i][ind][k];
yy += A[ind] * conj(B[i][ind][j]) * C[1][i][ind][k];
}
out[0][i][j][k] = xx;
out[1][i][j][k] = yy;
}
}
The python version utilizes numpy
and numba
:
import numba
@numba.njit(parallel=True)
def contract(A, B, C, out):
N = B.shape[0]
for i, j, k in numba.pndindex(B.shape):
xx = B[i, j, k]
yy = 0.0j
for l in range(N):
xx += A[l] * B[i, j, l] * C[0, i, l, k]
yy += A[l] * numpy.conj(B[i, l, j]) * C[1, i, l, k]
out[0, i, j, k] = xx
out[1, i, j, k] = yy
I suggest the following python code to compare the different versions in python and the analogous version in C++:
import numpy
import time
N = 200
A = numpy.ones(N, dtype=complex)
B = numpy.ones((N, N, N), dtype=complex) * (1.5 + 2.5j)
C = numpy.ones((2, N, N, N), dtype=complex)
C[0] *= 2
C[1] *= 3.5
out = numpy.empty((2, N, N, N), dtype=complex)
# Avoid delay due to compilation
contract(A, B, C, out)
start_time = time.perf_counter() # Begin time measurement
contract(A, B, C, out)
end_time = time.perf_counter() # End time measurement
run_time = end_time  start_time
print(run_time)
@numba.njit
def calc_res(out):
# This ist just some testing function
res = 0
for i in range(N):
for j in range(N):
for k in range(N):
res += (out[0, i, j, k]  out[1, i, j, k]) * (1.0 * numpy.sign(1 + numpy.real(res)))
return res
print(calc_res(out))
Which should print 3.588e92.202e10j
for N=200. This code should calculate this contraction quickly for up to N=600.
Currently, the best times I fould were ~0.77s for the python version and ~1.055s for the C++ version. It was already surprising, that the python/numba version is faster than C++. This is probably due the fact that these arrays are pointerstopointersto…
Of course, when benchmarking, only the call(s) to the contraction function should be measured. However, you are free to modify this in any way, shape or form you want to obtain better results. I have a lot of freedom when it comes to the actual code. Here are several thoughts I had:

It might be possible to utilize the graphics card I have using numba CUDA, but I don’t have much experience with that. I had written a short test version but this causes a Memory Error for N = 600 because I tried to dump the arrays into the graphics card memory all at once. Is there maybe a way to remedy this and use CUDA (preferrably from python, and not directly from C++)?

Next, I thought that maybe changing around the axes of the arrays could improve the performance if SIMD instructions can access array elements close to each other faster, but that also did not seem to change anything.

One important step was using parallelization as you can already see from the code. I am using OMP in C++ and numba’s parallelization in the python version.

It is also possible to split this into the real output and the imaginary output part. But in my tests that did not change much. I thought this might improve speed since SIMD instructions might work better because the memory layout would be better.
You might make any appropriate change to the contract
function or in general the code in between the two measurement lines.
Generally, I was quite surprised to find that these types of contractions are so slow. Surely, there must be some way to further improve the speed of this computation. It seems there are a lot of tools available when it comes to areas such as machine learning which I also discovered while searching for numba/CUDA.
Additional info for C++
The C++ code I am using for testing roughly looks as follows. I have omitted the actual implementation for a few functions for brevity, but these functions are probably not relevant for the speed and are not called in the time critical region.
#include <iostream>
#include <complex>
#include <chrono>
typedef std::complex<double> CX;
const CX I(0, 1);
const int N = 200;
int main() {
CX *A = new CX[N];
CX ***B = make_array_3d<CX>(N);
CX ****C = new CX***[2];
C[0] = make_array_3d<CX>(N);
C[1] = make_array_3d<CX>(N);
CX ****out = new CX***[2];
out[0] = make_array_3d<CX>(N);
out[1] = make_array_3d<CX>(N);
fill_1d<CX>(N, A, 1.0);
fill_3d<CX>(N, B, 1.5 + 2.5 * I);
fill_3d<CX>(N, C[0], 2.0);
fill_3d<CX>(N, C[1], 3.5);
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
contract<CX>(N, A, B, C, out);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
// Output in microseconds
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end  begin).count() << std::endl;
// Calculate the result for testing purposes
CX res = 0.0;
for (int n = 0; n < N; n++)
for (int si = 0; si < N; si++)
for (int t = 0; t < N; t++)
res += (out[0][n][si][t]  out[1][n][si][t]) * (1.0 * sgn(1 + real(res)));
std::cout << res << std::endl;
return 0;
}
The call to the compiler for C++ is: g++ test.cpp O3 fopenmp
. I also checked Ofast
, but that makes up for 0.1s at most.
Solution
Let’s look at just out[0, i, j, k]
for now. The inner loop looks like u += A[l]*B[i, j, l]*C[0, i, l, k]
.
What is its memory access pattern? u
is local and all of A
will fit in the cache. Just ignore them. This loop will be doing 2 memory loads (B
and C
) per iteration. Suppose N = 600
. Between B
and C
, their combined sizes are 8 bytes/float * 2*600**3 .= 3.2 GB
of memory. So that’s definitely not cacheable. A typical cache is measured in megabytes—not gigabytes.
Now, for each iteration, how much actual computation is happening? I count two float multiplications and a float addition, for a total of three flops.
This means that this computation is going to be RAM bound—not CPU bound. To make this go faster, you should use the fastest RAM you have available to you. The RAM in a graphics card moves at 10x100x the speed of the host RAM. So, I highly recommend running this on a GPU. Not because a GPU has a higher processing speed (which it does), but because it has a higher memory bandwidth.
The other half of this calculation is similar.