Problem
I have a BLAST output file and want to calculate query coverage, appending the query lengths as an additional column to the output. Let’s say I have
2 7 15
f=open('file.txt', 'r')
lines=f.readlines()
import re
for line in lines:
new_list=re.split(r't+',line.strip())
q_start=new_list[0]
q_end=new_list[1]
q_len=new_list[3]
q_cov=((float(q_end)-float(q_start))/float(q_len))*100
q_cov=round(q_cov,1)
q_cov=str(q_cov)
new_list.append(q_cov)
r=open('results.txt', 'a')
x='t'.join(new_list)
x=x+'n'
r.writelines(x)
f.close()
r.close()
Solution
One serious bug is that you open results.txt
for each line of input. It’s almost always better to open files in a with
block. Then, you won’t have to worry about closing your filehandles, even if the code exits abnormally. The with
block would have made your results.txt
mistake obvious as well.
Since you want to treat your q_start
, q_end
, and q_len
as numbers, I wouldn’t even bother to assign their string representations to a variable. Just convert them to a float
as soon as possible. Similarly, q_cov
should be a float
; I would just stringify it at the last moment. I would also postpone rounding just for the purposes of formatting the output, preferring to preserve precision in q_cov
itself.
Put your import
statements at the beginning of the program.
import re
with open('file.txt') as input, open('results.txt', 'a') as output:
for line in input.readlines():
fields = re.split(r't+', line.strip())
q_start, q_end, q_len = map(float, (fields[0], fields[1], fields[3]))
q_cov = 100 * (q_end - q_start) / q_len
print >>output, 't'.join(fields + [str(round(q_cov, 1))])