How does the scapy crawler detect whether the configured IP proxy is valid?

Hello!

question:

  1. can detect the header information of the current request through response.request.headers () (because the user-agent is random), but want to determine whether the configured IP agent is valid and how to get which ip? is used for the current request.
  2. generally speaking, if the user-agent, and ip addresses are changed, the web page with CAPTCHA will not appear, right? If it is because of cookie, I still have a CAPTCHA to let my crawler stop running without cookie, so I suspect that the IP agent is not configured.

    -Middleware for sharp proxy interface
    class ProxyAPIMiddleware (object):

       def process_request(self, request, spider):
           req = urllib.request.Request("ipurl")
           response = urllib.request.urlopen(req)
           ip = "http://%s" % str(response.read(), "utf-8")   -sharpip+ 
           request.meta["proxy"] = ip                         -sharpip
           print(request.meta["proxy"])                       -sharp APIrequest.meta["proxy"] = ip ip
    

Runtime:

     .
     .
     .
     .
    2018-06-23 15:57:29 [scrapy.middleware] INFO: Enabled spider middlewares:
    ["scrapy.spidermiddlewares.httperror.HttpErrorMiddleware",
     "scrapy.spidermiddlewares.offsite.OffsiteMiddleware",
     "scrapy.spidermiddlewares.referer.RefererMiddleware",
     "scrapy.spidermiddlewares.urllength.UrlLengthMiddleware",
     "scrapy.spidermiddlewares.depth.DepthMiddleware"]
    =================
    2018-06-23 15:57:30 [scrapy.middleware] INFO: Enabled item pipelines:
    ["soopat_patent.pipelines.SoopatPatentPipeline"]
    2018-06-23 15:57:30 [scrapy.core.engine] INFO: Spider opened
    2018-06-23 15:57:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-06-23 15:57:30 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1
    
    ip: http://122.230.248.127:4523
    
    http://122.230.248.127:4523
    
    2018-06-23 15:57:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.soopat.com/> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
    User-Agent: Mozilla/5.0 (compatible; WOW64; MSIE 10.0; Windows NT 6.2)
    
    ip: http://60.172.68.112:4507
    
    http://60.172.68.112:4507
    2018-06-23 15:57:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.soopat.com/> (referer: http://www.soopat.com/)
    =========================
    User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10
    
    ip: http://140.255.4.142:4523
    
    http://140.255.4.142:4523
    
    2018-06-23 15:57:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET ....)
    
    
    ...
    []
    
    list index out of range
    2018-06-23 15:57:48 [scrapy.core.engine] INFO: Closing spider (finished)
    2018-06-23 15:57:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    
    .
    .
    .
    
    

the scrapy crawler is normal at first, and the data is stored normally, but when it is run again the next day, it is directly blocked by the CAPTCHA.
Crawler Xiaobai, humbly ask for advice, thank you.

Mar.21,2021

in general, you can access third-party APIs to obtain agent information, such as Taobao IP location identification service, or you can build a public network interface yourself.

Menu