Problem
There many other better ways to parse BLAST output in .xml format, but I was curious to try using regex, even if it is not so straightforward and common. Here is the code how to extract translated query sequences from BLASTX output in fasta format.
################
fasta_out = True
#################
import re
import sys
base = sys.argv[1]
base = base.rstrip('xml')
if fasta_out == True:
seq_out = open(base+'fasta', 'w')
read_def = set()
with open(sys.argv[1],'rb') as xml:
for line in xml:
if re.search('<Iteration_query-def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Iteration_query-def>', '', line)
line = re.sub('</Iteration_query-def>', '', line)
query_def = line
if re.search('<Hit_def>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Hit_def>', '', line)
line = re.sub('</Hit_def>', '', line)
hit_def = line[:line.index(' ')]
if re.search('<Hsp_query-frame>', line) != None:
line = line.strip()
line = line.rstrip()
line = re.sub('<Hsp_query-frame>', '', line)
line = re.sub('</<Hsp_query-frame>', '', line)
frame = line
if re.search('<Hsp_qseq>', line) != None:
if query_def not in read_def:
read_def.add(query_def)
line = line.strip()
line = line.rstrip()
line = re.sub('<Hsp_qseq>', '', line)
line = re.sub('</Hsp_qseq>', '', line)
seq = line
if fasta_out == True:
print >> seq_out, '>'+query_def+' '+hit_def+'n'+seq
if fasta_out == True:
seq_out.close()
Input sample
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_version>blastx 2.2.21 [Jun-14-2009]</BlastOutput_version>
<BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~"Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
<BlastOutput_db>Triticum_aestivum.pep.all.fa</BlastOutput_db>
<BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
<BlastOutput_query-def>contig1</BlastOutput_query-def>
<BlastOutput_query-len>1506</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_matrix>BLOSUM62</Parameters_matrix>
<Parameters_expect>0.001</Parameters_expect>
<Parameters_gap-open>11</Parameters_gap-open>
<Parameters_gap-extend>1</Parameters_gap-extend>
<Parameters_filter>F</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>lcl|1_0</Iteration_query-ID>
<Iteration_query-def>contig1</Iteration_query-def>
<Iteration_query-len>1506</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|BL_ORD_ID|37668</Hit_id>
<Hit_def>Traes_4BL_9859F0705.1 pep:novel scaffold:IWGSP1:IWGSC_CSS_4BL_scaff_7021186:3087:16040:1 gene:Traes_4BL_9859F0705 transcript:Traes_4BL_9859F0705.1 description:""</Hit_def>
<Hit_accession>37668</Hit_accession>
<Hit_len>1032</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>1021.92</Hsp_bit-score>
<Hsp_score>2641</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>1506</Hsp_query-to>
<Hsp_hit-from>210</Hsp_hit-from>
<Hsp_hit-to>711</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_identity>501</Hsp_identity>
<Hsp_positive>501</Hsp_positive>
<Hsp_align-len>502</Hsp_align-len>
<Hsp_qseq>LAVDSEFNQVLQSDSCRLYQLQSHTCSQGHPLNRFTWGNKKSLSNAMGSGINLREEILQMYMSNYHGGMMKLVIIGGEPLDTLEAWTMELFSEVKAGPLLEISPKTDMPFWKSGKLHKLEAVRDVHSLFLSWTLPCLHKEYMKKPEDYLAHLLGHEGKGSLLYFLKAKGWASTLSAGVGTGGTQRSSYAYIFEMSIRLSDSGLKNLFEVITAVYQYINLLKQSEPQEWIFKELQDIGYMEFRFAEEQPPDDYVVDLAENMLFYSEKHIISGEYIYEGWEPELVKHVLSFFHPDNMRVDILSRSFDKQSQAIRCEPWFGSQYIEEDIPSSLIESWRNPVEIDGNFHLPRKNEYIPGDFSLRNASIPKSSNDDNPRCIVDEPFIKLWHKMDITFNVPRANAYFLISVKDGCSSLRNSVLTDLFANLLKDELNEVLYQAYVAKLETSLSVVGSNLELKLYGYNDKLAILLSHILAASQSFSPKIDRFEVIKEDLERAYRNTNMKP</Hsp_qseq>
<Hsp_hseq>LAVDSEFNQVLQSDSCRLYQLQSHTCSQGHPLNRFTWGNNKSLSNAMGSGINLREEILQMYMSNYHGGMMKLVIIGGEPLDTLEAWTMELFSEVKAGPLLEISPKTDMPFWKSGKLHKLEAVRDVHSLFLSWTLPCLHKEYMKKPEDYLAHLLGHEGKGSLLYFLKAKGWASTLSAGVGTGGTQRSSYAYIFEMSIRLSDSGLKNLFEVITAVYQYINLLKQSEPQEWIFKELQDIGYMEFRFAEEQPPDDYVVDLAENMLFYSEKHIISGEYIYEGWEPELVKHVLSFFHPDNMRVDILSRSFDKQSQAIRCEPWFGSQYIEEDIPSSLIESWRNPVEIDGNFHLPRKNEYIPGDFSLRNASIPKSSNDDNPRCIVDEPFIKLWHKMDITFNVPRANAYFLISVKDGCSSLRNSVLTDLFANLLKDELNEVLYQAYVAKLETSLSVVGSNLELKLYGYNDKLAILLSHILAASQSFSPKIDRFEVIKEDLERAYRNTNMKP</Hsp_hseq>
<Hsp_midline>LAVDSEFNQVLQSDSCRLYQLQSHTCSQGHPLNRFTWGN KSLSNAMGSGINLREEILQMYMSNYHGGMMKLVIIGGEPLDTLEAWTMELFSEVKAGPLLEISPKTDMPFWKSGKLHKLEAVRDVHSLFLSWTLPCLHKEYMKKPEDYLAHLLGHEGKGSLLYFLKAKGWASTLSAGVGTGGTQRSSYAYIFEMSIRLSDSGLKNLFEVITAVYQYINLLKQSEPQEWIFKELQDIGYMEFRFAEEQPPDDYVVDLAENMLFYSEKHIISGEYIYEGWEPELVKHVLSFFHPDNMRVDILSRSFDKQSQAIRCEPWFGSQYIEEDIPSSLIESWRNPVEIDGNFHLPRKNEYIPGDFSLRNASIPKSSNDDNPRCIVDEPFIKLWHKMDITFNVPRANAYFLISVKDGCSSLRNSVLTDLFANLLKDELNEVLYQAYVAKLETSLSVVGSNLELKLYGYNDKLAILLSHILAASQSFSPKIDRFEVIKEDLERAYRNTNMKP</Hsp_midline>
Solution
Redundancy
You have a lot of redundant code that you repeat several times.
-
You can move these lines:
line = line.strip() line = line.rstrip()
to immediately after
for line in xml:
. This saves you from having tostrip
in each if-statement. -
As pointed out above, you use
strip
andrstrip
. This is also redundant:strip
removes both leading and trailing characters from the string, whilerstrip
removes only trailing characters. So unless you are removing extra trailing characters (which it seems you aren’t)rstrip
is not needed. -
Your
re.sub
statements can be compacted into one statement using the pipe operator (a|b
) which searches for botha
andb
:line = re.sub('<Hit_def>|</Hit_def>', '', line)
-
What is
frame
doing? In your code snippet, it serves no purpose. -
Your
re.search
functions are not needed. If you dore.sub
on a string and the pattern you gave it does not find a match, it will simply return the original string:>>>foo = 'Hello World!' >>>foo = re.sub('apples', 'oranges', foo) >>>foo 'Hello World!'
Another problem with this is, in its current state, your code could throw a
NameError
because one of the variablesquery_def, hit_def, frame, seq
may not have been defined:if re.search('<Hsp_qseq>', line) != None: # I will error out because if the line didn't contain `<Iteration_query-def>` # `query_def` was never created! if query_def not in read_def:
Order
import
s always go at the top.
If-Statements
While your statements are technically correct, they can be made more pythonic by excluding the == True
and != None
:
# This is the same as...
foo = True
if foo:
# ...this
if foo == True:
Also, your statements may be able to be changed into elif
s depending on whether a single line can contain multiple of the tags you are looking for. If not, then elif
s will help you performace quite a bit.
Another note, if
String Formatting
The most pythonic and preferred way to create strings is to use str.format
:
foo = '{} {}!'.format('Hello', 'World')
Using this method in your application will remove all of the messy string concatenation:
print >> seq_out, '>{} {}n{}'.format(query_def, hit_def, seq)
Files
I am going to assume that you need to run the parsing no matter what and only print to seq_out
if fasta_out
is True
. With that being said, instead of writing to the file each time, which requires you to use the insecure open()
and close()
functions, we will simply create a single string which you will write once at the very end:
seq_body_text = ''
with open(sys.argv[1],'rb') as xml:
for line in xml:
# Yadda yadda....do regex stuff
...
if fasta_out:
new_line = '>{} {}n{}'.format(query_def, hit_def, seq)
seq_body_text = '{}{}n'.format(seq_body_text, new_line)
# Once the whole file is parsed, write the body text.
if fasta_out:
with open(seq_out, 'w') as file:
file.write(seq_body_text)
Here is my version of your code.
NOTE: I assume each line can have each of the tags. However, I could have assumed wrong because your logical structure was slightly muddy.
import re
import sys
base = sys.argv[1]
base = base.rstrip('xml')
fasta_out = True
read_def = set()
with open(sys.argv[1],'rb') as xml:
for line in xml:
line = line.strip()
line = re.sub('<Iteration_query-def>|</Iteration_query-def>', '', line)
query_def = line
if query_def not in read_def:
read_def.add(query_def)
if fasta_out:
# Why compute these lines in the main loop? Move them here
# and save computation.
line = re.sub('<Hit_def>|</Hit_def>', '', line)
hit_def = line[:line.index(' ')]
line = re.sub('<Hsp_query-frame>|</Hsp_query-frame>', '', line)
line = re.sub('<Hsp_qseq>|</Hsp_qseq>', '', line)
seq = line
new_line = '>{} {}n{}'.format(query_def, hit_def, seq)
seq_body_text = '{}{}n'.format(seq_body_text, new_line)
# Once the whole file is parsed, write the body text.
if fasta_out:
with open('{}{}'.format(base, 'fasta'), 'w') as file:
file.write(seq_body_text)
If you really want to parse xml file with python regex, you should match against whole file, not per lines. For example:
matched = re.search("<Hit_def>(.*?)</Hit_def>", wholexml, re.S)
if matched: print matched.group(1)
Here you have tutorial for advanced regex: http://www.cofoh.com/advanced-regex-tutorial-python
But regex for xml is a bad idea. It should be done by xml library. In the above example you have to even deal with special XML characters (& quot;, …)
Happy bioinformatics !