Callback in scrapy is useless. I have read the relevant problems on seg, but I can't solve them. I hope you can answer them.

problem description

crawl the list of Amazon products, save the data into mongodb
crawl the first page and pass the next page link to Request. You can get the next page link in shell
but you can only see the first page of data in the database
after crawling the first page of data, you can see that after crawling the first page of data, there is a link to the next page, but there is no data crawling

.

the platform version of the problem and what methods you have tried

linux,mongodb
tried to add dont_filter=Ture to Request
without success and climbed some unnecessary things

related codes

/ / Please paste the code text below (do not replace the code with pictures)

spider.py

from scrapy import Request, Spider
from amazon.items import AmazonItem
class AmazonSpider (Spider):

name = "book"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s/ref=lp_2649512011_il_ti_movies-tv?rh=n%3A2625373011%2Cn%3A%212625374011%2Cn%3A2649512011&ie=UTF8&qid=1533351160&lo=movies-tv"]


def parse(self, response):
    result = response.xpath("//div[@id="mainResults"]/ul/li")
    -sharp print(result)
    for it in result:
        item = AmazonItem()
        item["title"] = it.css("h2::text").extract_first()
        item["price"] = it.css(".a-link-normal.a-text-normal .a-offscreen::text").extract_first()
        yield item

    next_page = response.css("-sharpbottomBar -sharppagn -sharppagnNextLink::attr("href")").extract_first()
    url = response.urljoin("https://www.amazon.com",next_page)
    yield Request(url=url, callback=self.parse, dont_filter=True)

pipelines.py:

class MongoPipeline (object):

def __init__(self, mongo_uri, mongo_db):
    self.mongo_uri = mongo_uri        
    self.mongo_db = mongo_db



@classmethod
def from_crawler(cls, crawler):
    return cls(
        mongo_uri = crawler.settings.get("MONGO_URI"),
        mongo_db = crawler.settings.get("MONGO_DB")
        )


def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_uri)
    self.db = self.client[self.mongo_db]


def process_item(self, item, spider):
    self.db[item.collection].insert(dict(item))
    return item

def close_spider(self, spider):
    self.client.close()

what result do you expect? What is the error message actually seen?

want to be able to callback to parse after url is passed into Request, and crawl the related content of the next page
in the command line, you can see that after crawling the first page of data, there is a link to the next page, but there is no data crawling
and the link copied to the browser can be opened and is 2. 3.4. . The subsequent page
but I don"t know why data crawling is not performed

I hope the bosses will not hesitate to give us your advice. Thank you very much!

Apr.05,2021

to give you some ideas, here is


I tried
to the page on the second page again yesterday. After F12 found that the crawling rules have changed to
, I redefined the rules after the second page, and then the problem was solved

.
Menu