Automation is my hobby. Admittedly I never thought that I would be interested in it when I studied computer science and engineering back in my university days (my father said my personality befits a software engineer, and I blindly followed his advice). However, during my journey to become a competent software developer I invested much time in simple scripts to do repetitive work for me. While most of the above examples involve automating games (via something like Sikuli which can control your screen), lately I have been turning my attention to scraping websites and integrating APIs for data aggregation. This post contains some of the tips and tricks I have used in order to make your web crawler quicker and more robust.
Table of Contents
Before We Start – Requests vs Webdrivers
Many who write programs to crawl webpages often use Webdrivers such as Selenium to retrieve HTML contents. However, it can be a bit of an overkill if the pages you want to crawl are more static. For webpages with little to no dynamic content, using simple CURL libraries (such as Requests for python) for GET/POSTS will be sufficient for your needs.
Webdrivers are recommended for times when it is infeasible to purely use CURL for driving dynamic sites and content. This is especially true for users who aren’t scraping for information but merely wish to automate a series of actions on a website.
The remaining sections below contain advice for both CURL-based and Webdriver-based solutions. Parts that can be better explained with examples will contain sample code for reference – though a word of caution: Browsers and Webdriver API specs change frequently so they are not guaranteed to work in the future.
Tips and Advices
Python Libraries for Parsing HTML Documents
My goto libraries for processing HTML documents are a combination of html5lib, lxml, and xml.etree.ElementTree.
The following code’s “buildHtmlTree()” retrieves the HTML document from a URL and converts it into a tree ready for element traversal:
import xml.etree.ElementTree
from lxml import etree
from lxml import html
import html5lib
import requests
class RequestTooManyTimesException(Exception):
pass
class Html5Scraper(object):
def __init__(self):
self.current_url = ""
self.session = requests.Session()
def buildHtmlTree(self, url, encoding='UTF-8'):
self.current_url = url
html_elements = self.parseHtmlContent(self.getResponseText(url, encoding))
html_tree = etree.ElementTree(html_elements)
return html_elements
def parseHtmlContent(self, content_string):
# html5lib supports tags that are not closed, so parse twice to
# convert from html5lib's xml.etree.ElementTree to lxml's
# lxml.html.HtmlElement.
html5_etree = html5lib.parse(content_string)
html5_string = xml.etree.ElementTree.tostring(html5_etree, encoding='UTF-8')
return html.fromstring(html5_string, parser=html.HTMLParser(encoding='UTF-8'))
def getResponseText(self, url, encoding='UTF-8'):
response = None
retries = 0
while not response:
try:
response = self.session.get(url)
except requests.exceptions.SSLError:
retries += 1
if retries > 10:
raise RequestTooManyTimesException("Max connection exceeded when accessing: %s" % url)
time.sleep(60)
response.encoding = encoding
return response.text
Once you have the HTML document parsed into a tree, finding elements on the page becomes simple as follows, using xpath:
scraper = Html5Scraper()
tree = scraper.buildHtmlTree('https://www.google.com')
element = tree.xpath("//div[@id='mngb']")
[WebDrivers] Run in Headless or Disable Images
PhantomJSDriver is a selenium module which will emulate a headless (ie no Window) browser. It is particular useful in cases where only a terminal can be used (and no screen emulator software is installed). Because it is headless there is no overhead with regards to graphics and loading images, so unless you are doing UI testing and need to take screenshots of what webpages look like, it is not likely that you absolutely need to have a visual browser.
However, that said, PhantomJSDriver is not without its faults. This particular webdriver is not of the IE/FireFox/Chrome type, and therefore there may be compatibility issues when visiting some sites. When PhantomJSDriver couldn’t parse a website correctly or otherwise just doesn’t do the trick, more common browsers such as Chrome (via ChromeDriver) may handle that site well. Chrome is non-headless, but you may still turn a few things off – image loading included – to increase its performance. Below is a sample Python code of how to set up Chromedriver without image loading, plus a few other settings turned off:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
preferences = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", preferences)
chrome_options.add_argument("--ignore-certificate-errors");
chrome_options.add_argument("--disable-logging");
chrome_options.add_argument("--no-proxy-server");
chrome_options.add_argument("--disable-client-side-phishing-detection");
chrome_options.add_argument("--disable-sync");
chrome_options.add_argument("--disable-component-update");
chrome_options.add_argument("--disable-default-apps");
chrome_options.add_argument("--disable-infobars");
chrome_options.add_argument("--disable-web-security");
chrome_options.add_argument("--safebrowsing-disable-auto-update");
driver = webdriver.Chrome(chrome_options=chrome_options)
Refer to this page for information about each of the above arguments.
Search For Elements in ID -> Name -> Class -> CSS/XPath Order
A well-structured website should have most of its important elements tagged with a unique ID – thus it is most optimal to scan an element using its ID, followed by its name, followed by class, and lastly its other attributes. The further away from ID you search for an element the more brittle your program gets, and can lead to much time debugging whether the element you picked is what you are actually looking for. In addition, even a small change in the website UI can cause your program to fail to find the correct element.
Assuming “driver” is the object holding your webdriver, elements may be searched using the following sample methods:
driver.find_element_by_id("id")
driver.find_element_by_name("user")
driver.find_element_by_class_name("class")
driver.find_element_by_css_selector("#root > div > div.css")
driver.find_element_by_xpath("//div[@style='color: red']")
Refer to this page for information about locating elements. There are other methods to search for elements and I encourage you to learn and figure out which method is best for you. Also, there are separate methods for selecting multiple elements fitting a certain criteria – this is useful for parsing information from tabular structures such as tables and lists.
Don’t Click if You can “GET”
Webdriver users may get into the habit of clicking a link to move between pages. The advantage of clicking is that one does not have to worry about how to build the GET or POST request, which can save a lot of time when wanting to create a quick-and-easy solution. However, clicking an element increases risk and overhead because changes to the element may cause the program to fail to find it, and there may be unnecessary Javascript or other tasks the browser may run upon clicking a link.
If it is the simple matter of moving to another page and not having to worry about anything else, then GET-ing the URL is quicker and less prone to error. For POST, an inspection of the network when submitting the form can give you an idea of what to include in the form parameters, as illustrated in the below figure:
requests.session() to Store Cookies
Python’s requests API, in its simplest use case, can be thought of as a wrapper for CURL. Each call of requests.get() simply retrieves the source of the HTML document but ignores any session variables and cookies in the response. Fortunately requests also allows for sessions which can be used for exactly this purpose! The following is a simpler code of what was provided in the “Python Libraries for Parsing HTML Documents” to illustrate the simplicity of how sessions work – this is especially useful when parsing sites which require users to login:
import requests
session = requests.Session()
r = session.get('https://www.google.com')