How to grab the content on the first page when using CrawlSpider to turn the page?

I use CrawlSpider combined with the following Rules to automatically turn the page and climb the movie information of Douban top250:

rules = (
        Rule(LinkExtractor(restrict_xpaths="//span[@class="next"]/a"), 
        callback="parse_item", follow=True),
    )

because the information I want to crawl is on the surface of the web page, I don"t need to enter the URL on every page.

but the problem arises. Even if callback sets the handler, the LinkExtractor starts calling the callback function only when it extracts the link from the second page and generates page , so the content of the first page is gone.

some other solutions have been searched on the Internet, but most of them use two or more Rule (they need to get into the deep URL). You can solve this problem by writing the page flip code manually with the most basic Spider , but can you solve it with CrawlerSpider , because it looks a little more elegant.

Web-crawler scrapy python

Mar.12,2021

The default callback function for the content of the first page of

is parse_start_url,. You only need to override this method

LinkExtractor. What is written in this method is not regular, but what is written in the url on the next page that you match is LinkExtractor.

Previous: Element MessageBox indicates that the content rendering failed on Google browser at least 67 or above.

Next: Does the inline block element belong to the standard stream?

The Scrapy ImagesPipeline class cannot be executed.
when scrapy crawls a picture of a web page, the class that inherits ImagesPipelines is customized in the pipelines file. but the custom pipelines cannot be executed after running the program. Item cannot pass the following is a custom pipelines clas...

Web-crawler scrapy python

Mar.01,2021
Scrapy can only request one page at a time?
when I crawl a page with scrapy, I find that I can only request one page at a time, but the posts on the official website and Baidu say that the concurrency can be controlled through CONCURRENT_REQUESTS , but I tried it or it didn t work? CONCURRENT_...

Web-crawler scrapy python

Mar.02,2021
Scrapy.Request cannot enter callback
scrapy.Request cannot enter callback code is as follows: def isIdentifyingCode(self, response): -sharp pass def get_identifying_code(self, headers): -sharp -sharp return scrapy.Req...

Web-crawler scrapy python

Mar.05,2021
Why do you use scarpy to climb Dianping's city home page with content, but you can't get it when you climb by area?
as shown in the figure below, when the page is the food section of the whole city, for example, the URL of Xi an food is "http: www.dianping.com xian ch10 ", you can crawl the data normally (figure 1). 50 "http: www.dianping.com xian ... " Please ...

Python-crawler web-crawler scrapy python

Mar.14,2021
Python scrapy.Request could not download the web page
uses the scrapy.Request method to collect pages, but nothing is done. import scrapy def ret(response): print( start print ) print(response.body) url = https: doc.scrapy.org en latest intro tutorial.html v = scrapy.http.Request(url=url,...

Web-crawler scrapy python3.x

Mar.23,2021
Crawling Google Earth data from https://kh.google.com Random 403 error
when crawling Google Maps data using scrapy, the url accessed is http: kh.google.com flatfile., where the question mark is a parameter, and the following 403 errors will occur randomly: . the same url, may be downloaded normally after another try, ...

Web-crawler scrapy python

Jun.21,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-3a576d0-1946c.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-3a576d0-1946c.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?