Ask a python scrapy deep crawler problem. - Codes Helper - Programming Question Answer

Ask a python scrapy deep crawler problem.

after crawling the navigation, the URL crawl that you want to continue in-depth navigation, and then the unified return value is written to xlsx

< H1 >--coding: utf-8--< / H1 >

from lagou.items import LagouItem;
import scrapy

class LaGouSpider (scrapy.Spider):

name="lagou"
start_urls = ["https://www.lagou.com/"]
headers = {
    "Host": "onlinelibrary.wiley.com",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
    "Accept-Encoding": "gzip, deflate",
    "Referer": "http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1521-3773",
    "Cookie": "EuCookie="this site uses cookies"; __utma=235730399.1295424692.1421928359.1447763419.1447815829.20; s_fid=2945BB418F8B3FEE-1902CCBEDBBA7EA2; __atuvc=0%7C37%2C0%7C38%2C0%7C39%2C0%7C40%2C3%7C41; __gads=ID=44b4ae1ff8e30f86:T=1423626648:S=ALNI_MalhqbGv303qnu14HBk1HfhJIDrfQ; __utmz=235730399.1447763419.19.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; TrackJS=c428ef97-432b-443e-bdfe-0880dcf38417; OLProdServerID=1026; JSESSIONID=441E57608CA4A81DFA82F4C7432B400F.f03t02; WOLSIGNATURE=7f89d4e4-d588-49a2-9f19-26490ac3cdd3; REPORTINGWOLSIGNATURE=7306160150857908530; __utmc=235730399; s_vnum=1450355421193%26vn%3D2; s_cc=true; __utmb=235730399.3.10.1447815829; __utmt=1; s_invisit=true; s_visit=1; s_prevChannel=JOURNALS; s_prevProp1=TITLE_HOME; s_prevProp2=TITLE_HOME",
    "Connection": "keep-alive"
}

pass

def parse(self, response):

    mainNavs = response.xpath("//*[@class="menu_sub dn"]//dl");

    for content in mainNavs:
        item = LagouItem();
        -sharp mainNavs".//dt", //dt , .
        item["nav"] = content.xpath(".//dt//span//text()").extract_first();-sharp
        nav_title = content.xpath(".//dd//a");

        for nav in nav_title:
            item["url"] = nav.xpath(".//@href").extract_first()
            item["title"] = nav.xpath(".//text()").extract_first()
            -sharpif item["url"] is not None:

            -sharp tem["url"]  
            -sharpyield item
            -sharp 
            request = scrapy.http.Request(item["url"],headers=self.headers,callback=self.load_url);

            yield request;


-sharp
def load_url(self,response):

    aaa = response.xpath("//title/text()").extract_first()
    print(aaa) -sharp??
    print("----------------------")

Scrapy python-crawler

Mar.04,2021

Previous: Mini Program will forward a page of Mini Program to a friend has been unable to open, has been packed several times?

Next: WeChat Mini Programs request Asynchronous assignment problem

Scrapy scheduled task under centos, cannot be executed
execute after entering the project, the error shows scrapy command not found , but I-sharpscrapy can be run, the scrapy crawl test crawler command can also be executed alone, only the scheduled command will appear scrapy:command not found ...

Crontab scrapy python-crawler

Mar.04,2021
The problem of scrapy RetryMiddleware Middleware retry request carrying request header and proxy ip
goal: you want to launch the current request repeatedly when the request ip fails, or when the CAPTCHA is encountered, until the request succeeds, so as to reduce the data omission of crawling. question: I don t know if my thinking is correct. At pres...

Scrapy python-crawler

Mar.23,2021
Can we set a proxy for the spider using the scrapy_splash?
When I implemented a spider using Scrapy, I wanted to change the proxy of it so that the server wouldn t forbid my request according to the frequent requests from an ip. I also knew how to change the proxy with Scrapy, using middlewares or directly cha...

Scrapy python-crawler

Mar.30,2021
How scrapy crawls the content under the style= "display:none" tag when the display style of web page elements is set to invisible
as shown in the title, scrapy novice asks how to crawl the content under the style= "display:none " tag where the display style of web elements is set to invisible: the source code of the web page is as follows: <dl class="xxx" style=&qu...

Selenium scrapy python-crawler

Sep.24,2021
Using Scrapy-Redis to implement distributed crawlers how to gracefully keep the scheduling pool capable of crawling multiple machines at the same time? Why is the scheduling pool easy to be empty?
question : RedisCrawlSpider s crawler template is used in the project to achieve two-way crawling, that is, a Rule handles horizontal url crawling of the next page, and a Rule handles vertical detail page url crawling. Then the effect of distributed ...

Scrapy python-crawler

May.12,2022
Please ask me the question of scrapy crawler, thank you, online, etc.
ask, scrapy crawler, why did I send it to scrapy.Request https: www.tianyancha.com reportContent 24505794 2017 then print out the url in callback to become https: www.tianyancha.com login?from=https: www.tianyancha.com reportContent 24505794 2017...

Scrapy python-crawler python

Jun.20,2022
An error is reported during the operation of scrapy, ModuleNotFoundError: No module named 'pymongo'
I run the single file directly without import errors. In addition, it is normal for me to use mongodb in the py file alone, but when I run it in the scrapy project, I will say that the import failed. Why? import json import pymongo from scrapy.utils.pr...

Mongodb python scrapy python-crawler

Jul.02,2022
Scrapy cannot extract the next page
problem description cannot get the next page related codes Please paste the code text below (do not replace the code with pictures) import scrapy from qsbk.items import QsbkItem from scrapy.http.response.html import HtmlResponse from scra...

Scrapy python-crawler

Jul.05,2022

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-380e5b7-7081.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-380e5b7-7081.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?