Unique nucleotide permutations, Python itertools product

Posted on

Problem

I’m searching for all the unique possible permutations of nucleotide given a defined length. By unique, I mean the reverse complement would not be counted.

ACGT

For example, permutations of length 1 would be:

  1. A:T
  2. C:G
  3. G:C
  4. T:A

But the last two wouldn’t be included because they would already be seen.

Length 2 would be:

  1. AA:TT
  2. AC:GT
  3. AG:CT
  4. AT:AT
  5. CA:TG
  6. CC:GG
  7. CG:CG
  8. CT:AG
  9. GA:TC
  10. GC:GC
  11. GG:CC
  12. GT:AC
  13. TA:TA
  14. TC:GA
  15. TG:CA
  16. TT:AA

So, 8, 11, 12, 14, 15, 16 wouldn’t be included for the same reason, and so on.

def data_count(n, trantab):
    nn = int(n)
    count = 0
    seen = set()

    with open('file.txt','w') as f:
        for number in range(1, nn+1):
            for p in itertools.product('ACGT', repeat=number):
                print(''.join(p)+':'+''.join(p)[::-1].translate(trantab))
                if ''.join(p) in seen:
                    pass
                else:
                    seen.add(''.join(p)[::-1].translate(trantab))
                    count += 1
            #print(x)
            f.write(str(number)+' = '+str(count)+'n')
            count = 0

It does the job pretty well until 10 or 12, but beyond that it takes a lot of time to output results.

n is the length of the search and trantab is:

intab = 'ACGT'
outtab = 'TGCA'
trantab = str.maketrans(intab, outtab)

to get the reverse complement of the nucleotide.

Is there any way to improve this one?

Solution

To get the best performance depends on what you want.
If you want to only write to the file you should use maths.
If you want to generate them all, well that’s almost the best you’ll get.

And so if you want to do the latter I’d remove all of the former from the function.
And just create:

def data_count(data, dethp):
    translation = str.maketrans(data, data[::-1])
    for number in range(1, int(depth)+1):
        for product in itertools.product(data, repeat=number):
            product = ''.join(product)
            yield product, product[::-1].translate(translation)

The other bit is a little harder.
What you want to do is find a common factor.
For repeat=2, we get the bullet points in your question.
And so we should we should group them, there’s:

  1. Groups that add to seen.
  2. Groups that aren’t counted due to seen.
  3. Groups that are the same both sides.
# group 1
('AA', 'TT')
('AC', 'GT')
('AG', 'CT')
('CA', 'TG')
('CC', 'GG')
('GA', 'TC')

# group 2
('TT', 'AA')
('GT', 'AC')
('CT', 'AG')
('TG', 'CA')
('GG', 'CC')
('TC', 'GA')

# group 3
('TA', 'TA')
('GC', 'GC')
('CG', 'CG')
('AT', 'AT')

We can then use the above function to find out more about these groups.
And so we can get the following table:

1 + 2 + 3   | 1 + 3     | 3
------------+-----------+-----
4           | 2         | 0
16          | 10        | 4
64          | 32        | 0
256         | 136       | 16
1024        | 512       | 0
4096        | 2080      | 64
16384       | 8192      | 0
65536       | 32896     | 256
262144      | 131072    | 0
1048576     | 524800    | 1024
4194304     | 2097152   | 0
16777216    | 8390656   | 4096

From this we can see that 1 + 3 is the same as (1 + 2 + 3 + 3) // 2.
We can also find a pattern for 1 + 2 + 3 and for 3.
The first number is 4num4num. The second is almost 2num2num.
And so we can use this to find 1 + 3:

def data_count(depth):
    for n in range(1, int(depth)+1):
        yield n, (4 ** n + (0 if n % 2 else 2 ** n)) // 2

But if you want both, then you should use both functions.

Leave a Reply

Your email address will not be published. Required fields are marked *