How to grab the content on the first page when using CrawlSpider to turn the page?

I use CrawlSpider combined with the following Rules to automatically turn the page and climb the movie information of Douban top250:

rules = (
        Rule(LinkExtractor(restrict_xpaths="//span[@class="next"]/a"), 
        callback="parse_item", follow=True),
    )

because the information I want to crawl is on the surface of the web page, I don"t need to enter the URL on every page.

but the problem arises. Even if callback sets the handler, the LinkExtractor starts calling the callback function only when it extracts the link from the second page and generates page , so the content of the first page is gone.

some other solutions have been searched on the Internet, but most of them use two or more Rule (they need to get into the deep URL). You can solve this problem by writing the page flip code manually with the most basic Spider , but can you solve it with CrawlerSpider , because it looks a little more elegant.

Mar.12,2021
The default callback function for the content of the first page of

is parse_start_url,. You only need to override this method


LinkExtractor. What is written in this method is not regular, but what is written in the url on the next page that you match is LinkExtractor.

Menu