Problem
Here is my code:
def read_Coordinates_Atoms2(fileName, only_CA = True):
'''
in : PDB file
out : matrix with coordinates of atoms
'''
with open(fileName, 'r') as infile:
for line in infile :
if only_CA == True :
if line.startswith('ATOM') and line[13:15] == 'CA':
try: # matrix fill-up
CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
except NameError: # matrix declaration
CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)
else :
if line.startswith('ATOM'):
try: # matrix fill-up
CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
except NameError: # matrix declaration
CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)
return CoordAtoms
Is there a more efficient way to do this ? I mean, a way where I don’t have to write twice the same lines? I think the code should look more like this :
def foo(file, condition2 = True):
if condition1 and condition2 :
# do lots of instructions
elif condition1 :
# do the same lots of instructions (but different output)
Solution
Seeing that both your blocks are identical, you can be able to merge them using boolean logic.
First thing is that, in each case, you perform line.startswith('ATOM')
so put that first.
Second, either you have only_CA
being True
and you need 'CA'
at line[13:15]
too, or you have only_CA
being False
. In other words, you keep the line if either only_CA
is False
or 'CA'
is at line[13:15]
.
This lets you rewrite your for
loop as:
for line in infile:
if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
try: # matrix fill-up
CoordAtoms = np.vstack([CoordAtoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]]) # np.append
except NameError: # matrix declaration
CoordAtoms = np.array([[line[30:38],line[38:46], line[46:54]]], float)
You can also extract out the line parsing at it is somehow repeated:
for line in infile:
if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
data = [line[30:38], line[38:46], line[46:54]]
try: # matrix fill-up
CoordAtoms = np.vstack([CoordAtoms, [float(x) for x in data]]) # np.append
except NameError: # matrix declaration
CoordAtoms = np.array([data], float)
But you can also simplify the whole thing by converting your data to float before the try
and feeding np.array
data of the correct type:
for line in infile:
if line.startswith('ATOM') and (not only_CA or line[13:15] == 'CA'):
data = [float(line[begin:end]) for begin, end in ((30, 38), (38, 46), (46, 54))]
try: # matrix fill-up
CoordAtoms = np.vstack([CoordAtoms, [data]]) # np.append
except NameError: # matrix declaration
CoordAtoms = np.array([data])
I’ll approach this more from a StackOverflow numpy
efficiency perspective, not the CR styling one.
First simplification:
def read_Coordinates_Atoms2(fileName, only_CA = True):
'''
in : PDB file
out : matrix with coordinates of atoms
'''
# appending to a list is more efficient than array concatenation
coord_list = []
with open(fileName, 'r') as infile:
for line in infile :
# looks like you are parsing each line the same
parsed_line = [float(line[30:38]), float(line[38:46]), float(line[46:54])]
if only_CA == True :
if line.startswith('ATOM') and line[13:15] == 'CA':
coord_list.append(parsed_line)
else :
if line.startswith('ATOM'):
coord_list.append(parsed_line)
CoordAtoms = np.array(coord_list)
return CoordAtoms
Using list append will improve speed more than consolidating the ‘ifs’.
Two further changes come to mind:
We can collect all values as strings, and let np.array
do the conversion to float all at once. That’s a tentative change, and needs to be tested.
We can reword the conditionals. I’m leaving two append
blocks because I think it makes the logic clearer. Reworking the conditions so there is only one append
statement will not improve speed.
def read_Coordinates_Atoms2(fileName, only_CA = True):
# ....
coord_list = []
with open(fileName, 'r') as infile:
for line in infile :
# looks like you are parsing each line the same
parsed_line = [line[30:38], line[38:46], line[46:54]]
if line.startswith('ATOM'):
if only_CA and line[13:15] == 'CA':
coord_list.append(parsed_line)
else :
coord_list.append(parsed_line)
CoordAtoms = np.array(coord_list, dtype=float)
return CoordAtoms
np.genfromtxt
allows you to specify field widths as the delimiter. So an alternative design is to read the whole file as an appropriate structured array, and then filter out the desired elements.
A function like this should do it. I haven’t tested it, so I’m sure there are some bugs. I think speed will be similar, unless you are skipping a large number of lines. Both approaches have to read all lines, and that is the main time consumer.
def read_Coordinates_Atoms2(fileName, only_CA = True):
# ...
# complicated dtype because you are skipping some columns
# and it groups the 3 float fields into one array
dt = [('ATOM','S4'),('skip1',str),('CA','S2'),('skip2',str),('coor',float,(3,))]
del = [4,9,2,15,8,8,8]
data = np.genfromtxt(fileName, dtype=dt, delimiter=del)
idx = data['ATOM']
data = data[idx]
if only_CA:
idx = data['CA']=='CA'
data = data[idx]
return data['coor']
pandas
also has a fast and powerful csv
reader.
I could test these changes if you give a few sample lines.
Ok, first, some styling advises:
- Function / variable names should be named using
snake_case
convention. For example,read_Coordinates_Atoms2
will becomeread_coordinates_atoms2
if only_CA == True
can beif only_CA
- After
else
statement you shouldn’t have a space - After
,
you should always have a space with open(fileName, 'r') as infile
can bewith open(fileName) as infile
.open()
opens by default a file in read mode.- Usually, you shouldn’t have any spaces around
=
when declaring a function argument:only_CA = True
should beonly_CA=True
- Use
is
when comparing strings:line[13:15] == 'CA'
should be'CA' in line[13:15]
So far, with all the above, you’d have this:
def read_coordinates_atoms2(file_name, only_ca=True):
'''
in : PDB file
out : matrix with coordinates of atoms
'''
with open(file_name) as infile:
for line in infile:
if only_ca:
if line.startswith('ATOM') and 'CA' in line[13:15]:
try:
coord_atoms = np.vstack(
[coord_atoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]])
except NameError:
coord_atoms = np.array([[line[30:38], line[38:46], line[46:54]]], float)
else:
if line.startswith('ATOM'):
try: # matrix fill-up
coord_atoms = np.vstack(
[coord_atoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]])
except NameError:
coord_atoms = np.array([[line[30:38], line[38:46], line[46:54]]], float)
return coord_atoms
Follow DRY principle (DON’T REPEAT YOURSELF)
You’re doing this twice:
try:
coord_atoms = np.vstack([coord_atoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]])
except NameError:
coord_atoms = np.array([[line[30:38], line[38:46], line[46:54]]], float)
So let’s wrap it into a function:
def fill_or_declare_matrix(line, coord_atoms):
try:
return np.vstack([coord_atoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]])
except NameError:
return np.array([[line[30:38], line[38:46], line[46:54]]], float)
Now, we can get rid of those nested if/else
conditions by doing this:
if only_ca and line.startswith('ATOM') and 'CA' in line[13:15]:
...
elif only_ca and line.startswith('ATOM'):
...
So far we have this code:
def fill_or_declare_matrix(line, coord_atoms):
try:
return np.vstack([coord_atoms, [float(line[30:38]), float(line[38:46]), float(line[46:54])]])
except NameError:
return np.array([[line[30:38], line[38:46], line[46:54]]], float)
def read_coordinates_atoms2(file_name, only_ca=True):
# Input: PDB File;
# this function returns a matrix with coordinates of atoms
with open(file_name) as infile:
for line in infile:
if only_ca and line.startswith('ATOM') and 'CA' in line[13:15]:
coord_atoms = fill_or_declare_matrix(line, coord_atoms)
elif only_ca and line.startswith('ATOM'):
coord_atoms = fill_or_declare_matrix(line, coord_atoms)
return coord_atoms