Problem
I’m attempting to find companies who mention a particular service in on their homepage. To do this, I am iterating through a csv file with two columns – ID and URL. I’m using BeautifulSoup to get the html and regex to find the string.
At present, my code works, but it feels very clunky and takes forever. I’m also not writing my matching IDs to the new csv, which I haven’t been able to figure out.
Since this is at least working, hopefully, it will help someone else who is spinning their wheels trying to figure it out.
How can it be improved?
import requests
from bs4 import BeautifulSoup
import re
import csv
with open('web1.csv', mode='r') as infile:
reader = csv.reader(infile)
with open('websites_new.csv', mode='w') as outfile:
writer = csv.writer(outfile)
mydict = dict((rows[0],rows[1]) for rows in reader)
newlist = []
for v in mydict.itervalues():
try:
page = requests.get('http://www.' + v)
except:
pass
soup = BeautifulSoup(page.content, 'html.parser')
soupString = str(soup)
re1='.*?'
re2='(secretword)'
rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
m = rg.search(soupString)
if m is None:
value = 'x'
newlist.extend(value)
else:
newlist.extend(v)
print newlist
Solution
First of all, since you are applying a regular expression pattern to the complete source of the page, you don’t need an HTML parser like BeautifulSoup
– directly search inside the page.content
.
And, if you need to go the HTML parsing route and speed matters, choose either lxml
, or lxml
parser with BeautifulSoup
.
You may also look into reusing the same requests.Session()
instance – it may have a positive impact on performance.
Overall though, your approach is blocking/synchronous – your code processes URLs one by one – it would not process the next URL until it is done with the current one. Look into tools like Scrapy
to approach the problem in the asynchronous/non-blocking fashion.
Overall, I think your code is simple and nice enough. I agree with the points raised in alecxe’s answer too though.
One thing that I noticed when skimming your code for the first time is the use of re1
and re2
on lines 21 and 22, respectively. Normally, a good rule of thumb is if you’re numbering your variables, you might want to put them into a list.
However, as you only seem to have two regular expressions, I can understand if that might feel a little redundant. Regardless, I think you should at least make those variable names meaningful by putting their intended function in their names (e.g. instead of re2
perhaps reSecretWord
). Obviously this depends on your style guide/preferences.