Creating our spider

This is the code for our first spider. Save it in a file named MySpider.py under the spiders directory in your project:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']
    rules = (Rule(LxmlLinkExtractor(allow=())))

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        element = Item()
        return element

CrawlSpider provides a mechanism that allows you to follow the links that follow a certain pattern. Apart from the inherent attributes of the BaseSpider class, this class has a new rules attribute with which we can indicate to the spider the behavior that it should follow.