How does scrapy Rule parse links in json format?

rules = {
    "sina":(
        Rule(LinkExtractor(allow="/\d+-\d+-\d+\/.*?-.*?.shtml", deny=("http://search.sina.com.cn/.*?")),
             callback="parse_item", follow=True),
        )
}

as above, the aim is to resolve qualified links from the target page
example of the target page: https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page=1&encode=utf-8&callback=feedCardJsonpCallback&_=1545017197742
has tried a lot of regular methods. None of the links in "urls": "[\" https:\ /\ / news.sina.com.cn\ / o\ / 2018-12-18\ / doc-ihqhqcir7816653.shtml\ "] have been tested for regular expressions, but not in scrapy Rule

.
Feb.21,2022

for convenience, post part of your source code first:

class SinacrawlSpider(CrawlSpider):
    name = 'Sinacrawl'
    allowed_domains = ['sina.com.cn']
    start_urls = ['https://feed.sina.com.cn/api/roll/get?pageid=121&lid=1356&num=20&versionNumber=1.2.4&page=1&encode=utf-8&callback=feedCardJsonpCallback&_=1545017197742']

    rules = (
        Rule(LinkExtractor(allow='.*?\.shtml',deny=('http://search.sina.com.cn/.*?')), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow='a[href^="http"]'), follow=True)
    )

    def parse_item(self, response):
        item = SinacrawlItem()

is not a regular problem. I tried LinkExtractor (allow= ()) ) and still won't enter the parse_item function.
looking at the source code of scrapy, we can see that scrapy\ spiders\ crawl.py,line 56, _ requests_to_follow function

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

if response is not of type HtmlResponse, no further parsing is done.
the link of the subject does not return a html page.

Menu