Calculate query coverage from BLAST output

Posted on

Problem

I have a BLAST output file and want to calculate query coverage, appending the query lengths as an additional column to the output. Let’s say I have

2 7 15 
f=open('file.txt', 'r')
lines=f.readlines()
import re 
for line in lines:
    new_list=re.split(r't+',line.strip())
    q_start=new_list[0]
    q_end=new_list[1]
    q_len=new_list[3]
    q_cov=((float(q_end)-float(q_start))/float(q_len))*100
    q_cov=round(q_cov,1)
    q_cov=str(q_cov) 
    new_list.append(q_cov)
    r=open('results.txt', 'a')
    x='t'.join(new_list)
    x=x+'n'
    r.writelines(x)
f.close()
r.close()

Solution

One serious bug is that you open results.txt for each line of input. It’s almost always better to open files in a with block. Then, you won’t have to worry about closing your filehandles, even if the code exits abnormally. The with block would have made your results.txt mistake obvious as well.

Since you want to treat your q_start, q_end, and q_len as numbers, I wouldn’t even bother to assign their string representations to a variable. Just convert them to a float as soon as possible. Similarly, q_cov should be a float; I would just stringify it at the last moment. I would also postpone rounding just for the purposes of formatting the output, preferring to preserve precision in q_cov itself.

Put your import statements at the beginning of the program.

import re

with open('file.txt') as input, open('results.txt', 'a') as output:
    for line in input.readlines():
        fields = re.split(r't+', line.strip())
        q_start, q_end, q_len = map(float, (fields[0], fields[1], fields[3]))
        q_cov = 100 * (q_end - q_start) / q_len
        print >>output, 't'.join(fields + [str(round(q_cov, 1))])

Leave a Reply

Your email address will not be published. Required fields are marked *