Problem

I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.

For example:

```
genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0
```

is converted to

```
genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
[4.0, 4.0, 4.0, 3.0],
[5.5, 5.5, 2.5, 2.5]]
```

Here is my function:

```
def avg_dups(genes, values):
"""Finds duplicate genes and averages their expression data.
"""
unq_genes = np.unique(genes)
out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
for i, gene in enumerate(unq_genes):
dups = values[genes==gene]
out_values[i] = np.mean(dups, axis=0)
return (unq_genes, out_values)
```

This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?

Solution

This seems to be the fastest so far:

```
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, output
```

This finds the unique genes to fold the values into, along with the `current index → new index`

mapping and the number of repeated values that map to the same index:

```
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
```

It adds the row from each current index to the new index in the new `output`

:

```
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
```

`numpy.add.at(output, indices, values)`

is used over `output[indices] += values`

because the buffering used in `+=`

breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

```
output /= counts[:, newaxis]
```

Using Ashwini Chaudhary’s `generate_test_data(2000)`

(giving a 10000×4 array), my rough timings are:

```
name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement
```

Your current approach is slow because it takes O(n2)$\mathcal{O}({n}^{2})$ time, for each unique gene you’re checking for its index in `genes`

.

One not-pure NumPy approach that I can think of will take O(nlogn)$\mathcal{O}(n\mathrm{log}n)$ time for this (explanation in comments).

```
from collections import defaultdict
from itertools import count
import numpy as np
def avg_dups_fast(genes, values):
# Find the sorted indices of all genes so that we can group them together
sorted_indices = np.argsort(genes)
# Now create two arrays using `sorted_indices` where similar genes and
# the corresponding values are now grouped together
sorted_genes = genes[sorted_indices]
sorted_values = values[sorted_indices]
# Now to find each individual group we need to find the index where the
# gene value changes. We can use `numpy.where` with `numpy.diff` for this.
# But as numpy.diff won't work with string, so we need to generate
# some unique integers for genes, for that we can use
# collections.defaultdict with itertools.count.
# This dict will generate a new integer as soon as a
# new string is encountered and will save it as well so that same
# value is used for repeated strings.
d = defaultdict(count(0).next)
unique_ints = np.fromiter((d[x] for x in sorted_genes), dtype=int)
# Now get the indices
split_at = np.where(np.diff(unique_ints)!=0)[0] + 1
# split the `sorted_values` at those indices.
split_items = np.array_split(sorted_values, split_at)
return np.unique(sorted_genes), np.array([np.mean(arr, axis=0) for arr in split_items])
```

Also a Pure Python approach that will take only O(n)$\mathcal{O}(n)$ time. Here I’ve simply used a dictionary with the gene as key and the corresponding values are going to be appended in a list:

```
from collections import defaultdict
from itertools import izip
def avg_dups_python(genes, values):
d = defaultdict(list)
for k, v in izip(genes, values):
d[k].append(v)
return list(d), [np.mean(val, axis=0) for val in d.itervalues()]
```

Timing comparisons:

```
>>> from string import ascii_letters
>>> from itertools import islice, product
>>> def generate_test_data(n):
genes = np.array([''.join(x) for x in islice(product(ascii_letters, repeat=3), n)]*5, dtype='S3')
np.random.shuffle(genes)
vals = np.array([[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]]*n) # A: will be averaged with row=0
return genes, vals
...
>>> data = generate_test_data(20000)
>>> %timeit avg_dups(*data)
1 loops, best of 3: 18.4 s per loop
>>> %timeit avg_dups_fast(*data)
10 loops, best of 3: 166 ms per loop
>>> %timeit avg_dups_python(*data)
1 loops, best of 3: 253 ms per loop
>>> (avg_dups(*data)[1] == avg_dups_fast(*data)[1]).all()
True
```