Scraping data unveiling a button from craigslist

Posted on

Problem

I’ve written some code to parse the names and phone numbers from craigslist. It starts from the link in “m_url” then goes one layer deep to parse the name and then again another layer deep to parse the phone number. Note that it goes 2 layer deep only when it sees “show contact button” on that page so that it can unveil the phone number from that link to scrape. It only prints the result when it sees the button on that page. That’s because there are around 120 names on that page but it prints only those containing that specific button. Sometimes when I come across such “show contact button” link within a page from where I am supposed to harvest data, I get frightened. That’s why I tried to work on it. It works smoothly now. Any improvement on this script will be very helpful.

import re
import requests
from lxml import html

m_url = "http://bangalore.craigslist.co.in/search/reb?s=120"
base = "http://bangalore.craigslist.co.in"

def get_link(url):
    page1 = requests.get(url).text
    tree = html.fromstring(page1)
    for row in tree.xpath('//li[@class="result-row"]'):
        links = base + row.xpath(".//a[contains(concat(' ', @class, ' '), ' hdrlnk ')]/@href")[0]
        process_doc(links)

def process_doc(medium_link):

    page2 = requests.get(medium_link).text
    tree = html.fromstring(page2)
    try:
        name = tree.xpath('//span[@id="titletextonly"]/text()')[0]
    except IndexError:
        name = ""
    try:
        link = base + tree.xpath('//section[@id="postingbody"]//a[@class="showcontact"]/@href')[0]
    except IndexError:
        link = ""

    parse_doc(name, link)

def parse_doc(title, target_link):

    if target_link:
        page = requests.get(target_link).text            
        tel = re.findall(r'd{10}', page)[0] if re.findall(r'd{10}', page) else ""
        print(title, tel)

get_link(m_url)

Pics of that button:

enter image description here

Solution

There are some things I would do differently:

  • create a class – this way you may share a web-scraping session between the methods and also share the base and start urls
  • use more meaningful variable and method names – for instance, parse_doc() could be get_contact_info(); you can re-use page variable names in different methods etc.
  • you can use findtext() method to get the text
  • I would also return from the scraper and print out the results outside of it
  • you are searching for the phone number twice – instead, use the .search() method and check if you’ve got a match object or none
  • you can pre-compile and re-use the regular expression pattern for the phone number

Improved code:

import re

import requests
from lxml import html


class CraigListScraper:
    PHONE_NUMBER_PATTERN = re.compile(r'd{10}')

    def __init__(self, start_url, base_url):
        self.session = requests.Session()
        self.base_url = base_url
        self.start_url = start_url

    def scrape(self):
        page = self.session.get(self.start_url).text

        tree = html.fromstring(page)
        for row in tree.xpath('.//li[@class="result-row"]'):
            link = self.base_url + row.xpath(".//a[contains(concat(' ', @class, ' '), ' hdrlnk ')]/@href")[0]

            yield self.process_search_result(link)

    def process_search_result(self, medium_link):
        page = self.session.get(medium_link).text
        tree = html.fromstring(page)

        name = tree.findtext('.//span[@id="titletextonly"]')

        try:
            contact_info_link = self.base_url + tree.xpath('//section[@id="postingbody"]//a[@class="showcontact"]/@href')[0]
            phone_number = self.get_contact_info(contact_info_link)
        except IndexError:
            phone_number = ""

        return name, phone_number

    def get_contact_info(self, target_link):
        page = self.session.get(target_link).text

        match = self.PHONE_NUMBER_PATTERN.search(page)
        return match.group(0) if match else ""



if __name__ == '__main__':
    start_url = "http://bangalore.craigslist.co.in/search/reb?s=120"
    base_url = "http://bangalore.craigslist.co.in"

    scraper = CraigListScraper(start_url, base_url)
    for result in scraper.scrape():
        print(result)

Leave a Reply

Your email address will not be published. Required fields are marked *