Problem
I’ve written some code to parse the names and phone numbers from craigslist. It starts from the link in “m_url” then goes one layer deep to parse the name and then again another layer deep to parse the phone number. Note that it goes 2 layer deep only when it sees “show contact button” on that page so that it can unveil the phone number from that link to scrape. It only prints the result when it sees the button on that page. That’s because there are around 120 names on that page but it prints only those containing that specific button. Sometimes when I come across such “show contact button” link within a page from where I am supposed to harvest data, I get frightened. That’s why I tried to work on it. It works smoothly now. Any improvement on this script will be very helpful.
import re
import requests
from lxml import html
m_url = "http://bangalore.craigslist.co.in/search/reb?s=120"
base = "http://bangalore.craigslist.co.in"
def get_link(url):
page1 = requests.get(url).text
tree = html.fromstring(page1)
for row in tree.xpath('//li[@class="result-row"]'):
links = base + row.xpath(".//a[contains(concat(' ', @class, ' '), ' hdrlnk ')]/@href")[0]
process_doc(links)
def process_doc(medium_link):
page2 = requests.get(medium_link).text
tree = html.fromstring(page2)
try:
name = tree.xpath('//span[@id="titletextonly"]/text()')[0]
except IndexError:
name = ""
try:
link = base + tree.xpath('//section[@id="postingbody"]//a[@class="showcontact"]/@href')[0]
except IndexError:
link = ""
parse_doc(name, link)
def parse_doc(title, target_link):
if target_link:
page = requests.get(target_link).text
tel = re.findall(r'd{10}', page)[0] if re.findall(r'd{10}', page) else ""
print(title, tel)
get_link(m_url)
Pics of that button:
Solution
There are some things I would do differently:
- create a class – this way you may share a web-scraping session between the methods and also share the base and start urls
- use more meaningful variable and method names – for instance,
parse_doc()
could beget_contact_info()
; you can re-usepage
variable names in different methods etc. - you can use
findtext()
method to get the text - I would also return from the scraper and print out the results outside of it
- you are searching for the phone number twice – instead, use the
.search()
method and check if you’ve got amatch
object or none - you can pre-compile and re-use the regular expression pattern for the phone number
Improved code:
import re
import requests
from lxml import html
class CraigListScraper:
PHONE_NUMBER_PATTERN = re.compile(r'd{10}')
def __init__(self, start_url, base_url):
self.session = requests.Session()
self.base_url = base_url
self.start_url = start_url
def scrape(self):
page = self.session.get(self.start_url).text
tree = html.fromstring(page)
for row in tree.xpath('.//li[@class="result-row"]'):
link = self.base_url + row.xpath(".//a[contains(concat(' ', @class, ' '), ' hdrlnk ')]/@href")[0]
yield self.process_search_result(link)
def process_search_result(self, medium_link):
page = self.session.get(medium_link).text
tree = html.fromstring(page)
name = tree.findtext('.//span[@id="titletextonly"]')
try:
contact_info_link = self.base_url + tree.xpath('//section[@id="postingbody"]//a[@class="showcontact"]/@href')[0]
phone_number = self.get_contact_info(contact_info_link)
except IndexError:
phone_number = ""
return name, phone_number
def get_contact_info(self, target_link):
page = self.session.get(target_link).text
match = self.PHONE_NUMBER_PATTERN.search(page)
return match.group(0) if match else ""
if __name__ == '__main__':
start_url = "http://bangalore.craigslist.co.in/search/reb?s=120"
base_url = "http://bangalore.craigslist.co.in"
scraper = CraigListScraper(start_url, base_url)
for result in scraper.scrape():
print(result)