I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.
genes = ['A', 'C', 'C', 'B', 'A'] vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1 [4.0, 4.0, 4.0, 3.0], # B: is fine [1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0
is converted to
genes = ['A', 'B', 'C'] vals = [[1.5, 1.5, 5.0, 5.0], [4.0, 4.0, 4.0, 3.0], [5.5, 5.5, 2.5, 2.5]]
Here is my function:
def avg_dups(genes, values): """Finds duplicate genes and averages their expression data. """ unq_genes = np.unique(genes) out_values = np.zeros((unq_genes.shape, values.shape)) for i, gene in enumerate(unq_genes): dups = values[genes==gene] out_values[i] = np.mean(dups, axis=0) return (unq_genes, out_values)
This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?
This seems to be the fastest so far:
import numpy from numpy import newaxis def avg_dups(genes, values): folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True) output = numpy.zeros((folded.shape, values.shape)) numpy.add.at(output, indices, values) output /= counts[:, newaxis] return folded, output
This finds the unique genes to fold the values into, along with the
current index → new index mapping and the number of repeated values that map to the same index:
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
It adds the row from each current index to the new index in the new
output = numpy.zeros((folded.shape, values.shape)) numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values) is used over
output[indices] += values because the buffering used in
+= breaks the code for repeated indices.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= counts[:, newaxis]
Using Ashwini Chaudhary’s
generate_test_data(2000) (giving a 10000×4 array), my rough timings are:
name time/ms Author avg_dups 230 gwg avg_dups_fast 33 Ashwini Chaudhary avg_dups_python 45 Ashwini Chaudhary avg_dups 430 Veedrac avg_dups 5 Veedrac with Jaime's improvement
Your current approach is slow because it takes O(n2) time, for each unique gene you’re checking for its index in
One not-pure NumPy approach that I can think of will take O(nlogn) time for this (explanation in comments).
from collections import defaultdict from itertools import count import numpy as np def avg_dups_fast(genes, values): # Find the sorted indices of all genes so that we can group them together sorted_indices = np.argsort(genes) # Now create two arrays using `sorted_indices` where similar genes and # the corresponding values are now grouped together sorted_genes = genes[sorted_indices] sorted_values = values[sorted_indices] # Now to find each individual group we need to find the index where the # gene value changes. We can use `numpy.where` with `numpy.diff` for this. # But as numpy.diff won't work with string, so we need to generate # some unique integers for genes, for that we can use # collections.defaultdict with itertools.count. # This dict will generate a new integer as soon as a # new string is encountered and will save it as well so that same # value is used for repeated strings. d = defaultdict(count(0).next) unique_ints = np.fromiter((d[x] for x in sorted_genes), dtype=int) # Now get the indices split_at = np.where(np.diff(unique_ints)!=0) + 1 # split the `sorted_values` at those indices. split_items = np.array_split(sorted_values, split_at) return np.unique(sorted_genes), np.array([np.mean(arr, axis=0) for arr in split_items])
Also a Pure Python approach that will take only O(n) time. Here I’ve simply used a dictionary with the gene as key and the corresponding values are going to be appended in a list:
from collections import defaultdict from itertools import izip def avg_dups_python(genes, values): d = defaultdict(list) for k, v in izip(genes, values): d[k].append(v) return list(d), [np.mean(val, axis=0) for val in d.itervalues()]
from string import ascii_letters from itertools import islice, product def generate_test_data(n): genes = np.array([''.join(x) for x in islice(product(ascii_letters, repeat=3), n)]*5, dtype='S3') np.random.shuffle(genes) vals = np.array([[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1 [4.0, 4.0, 4.0, 3.0], # B: is fine [1.0, 1.0, 1.0, 1.0]]*n) # A: will be averaged with row=0 return genes, vals data = generate_test_data(20000) %timeit avg_dups(*data) 1 loops, best of 3: 18.4 s per loop %timeit avg_dups_fast(*data) 10 loops, best of 3: 166 ms per loop %timeit avg_dups_python(*data) 1 loops, best of 3: 253 ms per loop (avg_dups(*data) == avg_dups_fast(*data)).all() True