I’m attempting to find companies who mention a particular service in on their homepage. To do this, I am iterating through a csv file with two columns – ID and URL. I’m using BeautifulSoup to get the html and regex to find the string.
At present, my code works, but it feels very clunky and takes forever. I’m also not writing my matching IDs to the new csv, which I haven’t been able to figure out.
Since this is at least working, hopefully, it will help someone else who is spinning their wheels trying to figure it out.
How can it be improved?
import requests from bs4 import BeautifulSoup import re import csv with open('web1.csv', mode='r') as infile: reader = csv.reader(infile) with open('websites_new.csv', mode='w') as outfile: writer = csv.writer(outfile) mydict = dict((rows,rows) for rows in reader) newlist =  for v in mydict.itervalues(): try: page = requests.get('http://www.' + v) except: pass soup = BeautifulSoup(page.content, 'html.parser') soupString = str(soup) re1='.*?' re2='(secretword)' rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL) m = rg.search(soupString) if m is None: value = 'x' newlist.extend(value) else: newlist.extend(v) print newlist
First of all, since you are applying a regular expression pattern to the complete source of the page, you don’t need an HTML parser like
BeautifulSoup – directly search inside the
And, if you need to go the HTML parsing route and speed matters, choose either
lxml parser with
You may also look into reusing the same
requests.Session() instance – it may have a positive impact on performance.
Overall though, your approach is blocking/synchronous – your code processes URLs one by one – it would not process the next URL until it is done with the current one. Look into tools like
Scrapy to approach the problem in the asynchronous/non-blocking fashion.
Overall, I think your code is simple and nice enough. I agree with the points raised in alecxe’s answer too though.
One thing that I noticed when skimming your code for the first time is the use of
re2 on lines 21 and 22, respectively. Normally, a good rule of thumb is if you’re numbering your variables, you might want to put them into a list.
However, as you only seem to have two regular expressions, I can understand if that might feel a little redundant. Regardless, I think you should at least make those variable names meaningful by putting their intended function in their names (e.g. instead of
reSecretWord). Obviously this depends on your style guide/preferences.