Scrapy's question about customizing his own FilePipeline implementation file renaming

because scrapy"s own FilePipeline is named by downloading the hash code of url, you want to customize your own filepipeline, to rename the file. So google for a while, found that everyone said: inherit the FilesPipeline class and then rewrite the get_media_requests and file_path methods, the approximate code is as follows:

2018-04-02 17:14:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
["scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware",
 "scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware",
 "scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware",
 "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware",
 "scrapy.downloadermiddlewares.retry.RetryMiddleware",
 "scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware",
 "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware",
 "scrapy.downloadermiddlewares.redirect.RedirectMiddleware",
 "scrapy.downloadermiddlewares.cookies.CookiesMiddleware",
 "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware",
 "scrapy.downloadermiddlewares.stats.DownloaderStats"]
2018-04-02 17:14:46 [scrapy.middleware] INFO: Enabled spider middlewares:
["scrapy.spidermiddlewares.httperror.HttpErrorMiddleware",
 "scrapy.spidermiddlewares.offsite.OffsiteMiddleware",
 "scrapy.spidermiddlewares.referer.RefererMiddleware",
 "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware",
 "scrapy.spidermiddlewares.depth.DepthMiddleware"]
2018-04-02 17:14:46 [scrapy.middleware] INFO: Enabled item pipelines:
["cszgSpider.pipelines.MyfilesPipeline",
 "cszgSpider.pipelines.CszgspiderPipeline"]
2018-04-02 17:14:46 [scrapy.core.engine] INFO: Spider opened
2018-04-02 17:14:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-02 17:14:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-02 17:14:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://cishan.chinanpo.gov.cn/biz/ma/csmh/d/csmhddoSort.html?pageNo=1&search_condit=&sort=desc&flag=0> (referer: None)
2018-04-02 17:14:47 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cishan.chinanpo.gov.cn/biz/ma/csmh/d/csmhddoSort.html?pageNo=1&search_condit=&sort=desc&flag=0>
{"file_urls": ["http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162718ac70162854c92690082",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff808081627192280162853fdb1501e1",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162718ac701627b3fd553004f",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162718ac7016275e8bcda0039",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162718ac7016275d0c49e002d",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162719228016275c014f700dc",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162719228016275a2d33d00c6",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff8080816271922801627551a95d00af",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff80808162719228016274eb5c5300a3",
               "http://cishan.chinanpo.gov.cn/mz/upload/pub/load/resource_download.html?resourceid=ff8080816271922801627453688d000d"],
 "files": []}
2018-04-02 17:14:47 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-02 17:14:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{"downloader/request_bytes": 518,
 "downloader/request_count": 1,
 "downloader/request_method_count/GET": 1,
 "downloader/response_bytes": 12444,
 "downloader/response_count": 1,
 "downloader/response_status_count/200": 1,
 "finish_reason": "finished",
 "finish_time": datetime.datetime(2018, 4, 2, 9, 14, 47, 11611),
 "item_scraped_count": 1,
 "log_count/DEBUG": 3,
 "log_count/INFO": 7,
 "response_received_count": 1,
 "scheduler/dequeued": 1,
 "scheduler/dequeued/memory": 1,
 "scheduler/enqueued": 1,
 "scheduler/enqueued/memory": 1,
 "start_time": datetime.datetime(2018, 4, 2, 9, 14, 46, 433082)}
2018-04-02 17:14:47 [scrapy.core.engine] INFO: Spider closed (finished)

I have all the download links, but I don"t seem to have started the download. Why?

Feb.28,2021

have you solved this problem? I have encountered the same problem


you are thinking wrong. Response=None is specified in the parameter. So there is no response content in the result returned in the pipeline. If you want to achieve this, it is recommended to get response from spiders and save it in meta, pass it, get it, and rename the path


I also want to get the file name from the response header of response, and I encounter the same problem as the subject. It seems that response is always None
this is https://github.com/scrapy/scrapy/issues/4457

anyway.

search scrapy.pipelines.files.FilesPipeline for file_path this function, and there are three calls:

def file_path(self, request, response=None, info=None):
    if not response:
        return
    else:
        response.headers

through debug, we can find that we do get response

.
Menu