I’ve written some script in python following OOP guidelines. I’ve created two classes in my script in such a way so that both of them are able to make requests and produce flawless results. It can parse names, phone numbers and email addresses from yellowpage. I’ve already executed my script and found it working as I expected. Now it is able to parse the aforesaid category from any links from that site irrespective of how many pages it traverses. So far, I’ve used one class for making requests and the other for processing results. However, this time I intend to make a bridge between these two classes so that they can both be able to make requests and can do the same stuff if needed. I’ve tried my best to make a good class crawler. Any suggestions to make it more dynamic or improved will be highly appreciated.
from lxml import html import requests class web_parser: page_link = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=San%20Francisco%2C%20CA" main_url = "https://www.yellowpages.com" def __init__(self): self.direct_links = [self.page_link] self.vault =  def parser(self): for link in self.direct_links: self.get_link(link) self.processing_main_link(link) def processing_main_link(self, sub_links): page = requests.get(sub_links) tree = html.fromstring(page.text) page_links = tree.cssselect("div.pagination a") for page_link in page_links: if self.main_url + page_link.attrib["href"] not in self.direct_links: self.direct_links.append(self.main_url + page_link.attrib["href"]) def get_link(self, url): page = requests.get(url) tree = html.fromstring(page.text) item_links = tree.cssselect('h2.n a:not([itemprop="name"]).business-name') for item_link in item_links: if self.main_url + item_link.attrib["href"] not in self.vault: self.vault.append(self.main_url + item_link.attrib["href"]) class get_docs(web_parser): def __init__(self): web_parser.__init__(self) def procuring_links(self): for link in self.vault: self.using_links(link) def using_links(self, newly_created_link): page = requests.get(newly_created_link) tree = html.fromstring(page.text) name = tree.cssselect('div.sales-info h1') phone = tree.cssselect('p.phone') try: email = tree.cssselect('div.business-card-footer a.email-business').attrib["href"].replace("mailto:","") except IndexError: email = "" print(name.text, phone.text, email) if __name__ == '__main__': crawl = get_docs() crawl.parser() crawl.procuring_links()
- Naming convention: classes should normally use PascalCase.
- Requests best practices: if you want to use the
requestslibrary, you can see my answer about
- Speeding up your script:
- Threading: you can use the threading standard library to get data from URLs and process that data.
- Concurrent Futures: this approach exists from Python 3.2, there is a good example on the docs page. But you may have problems with using requests with the
concurrent.futureslibrary, so take a look at standard.
- Scrapy: modern Python for scraping data from websites. It provides lots of built-in functionality: from scraping to storing results in a way you wish.
- Logging: it is a good practice to use logging while doing any actions, e.g. scraping.
- Exceptions: adding exception handlers is good practice. At the moment, if your script fails in any line, it stops all execution.