Problem
I have a file with just 3500 lines like these:
filecontent= "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234"
Then I want to grab every line from the filecontent
that matches a certain string (with python 2.7):
this_item= "IBMlalala123"
matchingitems = re.findall(".*?;.*?;.*?;.*?;.*?;.*?;.*?"+this_item,filecontent)
It needs 17 seconds for each findall
. I need to search 4000 times in these 3500 lines. It takes forever. Any idea how to speed it up?
Solution
.*?;.*?
will cause catastrophic backtracking.
To resolve the performance issues, remove .*?;
and replace it with [^;]*;
, that should be much faster.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinski
A few things to be commented :
-
Regular expressions might not be the right tool for this.
-
.*?;.*?;.*?;.*?;.*?;.*?;.*?"
is potentially very slow and might not do what you want it to do (it could match many more;
than what you want).[^;]*;
would most probably do what you want.
Use split, like so:
>>> filecontent = "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234";
>>> items = filecontent.split(";");
>>> items;
['13P397', 'Fotostuff', 't', 'IBM', 'IBM lalala 123|IBM lalala 1234', '28.000 things', '', 'IBMlalala123|IBMlalala1234']
>>>
I’m a bit unsure as what you wanted to do in the last step, but perhaps something like this?
>>> [(i, e) for i,e in enumerate(items) if 'IBMlalala123' in e]
[(7, 'IBMlalala123|IBMlalala1234')]
>>>
UPDATE:
If I get your requirements right on the second attempt: To find all lines in file having ‘IBMlalala123’ as any one of the semicolon-separated fields, do the following:
>>> with open('big.file', 'r') as f:
>>> matching_lines = [line for line in f.readlines() if 'IBMlalala123' in line.split(";")]
>>>
Some thoughts:
Do you need a regex? You want a line that contains the string so why not use ‘in’?
If you are using the regex to validate the line format, you can do that after the less expensive ‘in’ finds a candidate line reducing the number of times the regex is used.
If you do need a regex then what about replacing ‘.?;’ with ‘[^;];’ ?