CSDN file crawler - Codes Helper - Programming Question Answer

CSDN file crawler

problem description

in CSDN, log in as a member, normally click this button to download the file

a url

can always crawl the file according to this url, but recently may have taken some measures, so that click on the url below can not download the file, that is, the crawler crawled below the url can not climb the file, return is a 404 web page.

related codes

requests library of python used.
this is the download function in the crawler class.

def download(self, remote_url, local_dir):

    -sharp 1.
    if not self.__is_logined:
        self.__login()

    -sharp +1
    self.download_count += 1

    count = 0
    while count < 3:
        count += 1

        -sharp 2.URL
        html_text = self.__session.get(remote_url).text
        html = BeautifulSoup(html_text, "html5lib")
        real_url = html.find("a", id="vip_btn").attrs["href"]

        -sharp 3.
        source = self.__session.get(real_url)

        -sharp 3.1
        filename = re.findall(r".*\"(.*)\"$", source.headers.get("Content-Disposition", "\"None\""))[0]
        if filename == "None":
            continue
        filename = re.sub("\s", "_", filename)

        -sharp 3.2
        if not os.path.exists(local_dir):
            os.makedirs(local_dir)
        _local_path = local_dir + filename

        -sharp 3.3
        local_file = open(_local_path.encode("utf-8"), "wb")
        for file_buffer in source.iter_content(chunk_size=512):
            if file_buffer:
                local_file.write(file_buffer)
        return _local_path

    return None

run according to the above code, return a 404 web page, ask the god how to correctly climb to the file ~

Python web-crawler

Jun.06,2022

check whether header is missing cookie or related parameters

Previous: In React, is a component's lifecycle function called only once during a load or update?

Next: Baidu Map overlay-Custom search results. Default red overlay covers how to solve the new custom overlay.

Python crawler ip agent problem
self.s = requests.session () -sharp -sharp proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" -sharp proxyUser = "HH30H1A522679P8D" proxyPass = "74EF13F061719736" proxyMeta = "http: %(user)s:%(pas...

Python web-crawler

Feb.28,2021
Python crawler regularity problem
<tr> <td>8< td> <td> ...

Python web-crawler

Feb.28,2021
I want to crawl Douban book search keywords after the content, you can check the source code garbled.
URL: https: book.douban.com subje. I want to climb to get the names, number of reviews, and ratings of all books searched by Douban keywords, but after I opened the source code interface, the following situation occurred. There is no problem with usin...

Python web-crawler

Mar.02,2021
Request failed to request a page after header was configured.
found that a page still cannot get page data after configuring host,U-An in header routinely. the get command sent is checked through the debugging tool, and there is no difference. I really can t find the reason. Is it because I lack that part of k...

Python web-crawler

Mar.04,2021
Search headless browser cannot search
Enterprise search cannot be searched with selenium headless browser https: www.qichacha.com ...

Python web-crawler

Mar.18,2021
Selenium analog search I search is always unable to locate the label, no problem.
class qichacha: def __init__(self): option = webdriver.ChromeOptions() option.add_argument( --start-maximized ) -sharp option.add_argument( --headless ) -sharp self.driver = webdriver.Chrome(chrome_options...

Python web-crawler

Mar.18,2021
Check if you do not log in to selenium, there is a problem with the simulated search.
Traceback (most recent call last): File "qichacha.py", line 139, in <module> qichacha().read_data() File "qichacha.py", line 39, in read_data self.search_index(name) File "qichacha.py", line 92, in search...

Python web-crawler

Mar.18,2021
The popular travel notes that crawled to the home page of the hornet nest encountered the problem of paging request parameters?
I encountered a problem when I wrote for the first time that the crawler wanted to crawl the travel notes on the home page of the hornet s nest. as follows figure 1.1 I want to mainly crawl the popular travel notes on the home page. 1.1 Chrome page...

Python web-crawler

Mar.18,2021
How does selenium switch ip
how to switch the format of ip with account and password in selenium how to switch ip with account and password on selenium ip and port, account and password for example: wrewre52a@117.41.186.194:888 can t be found on the Internet. ...

Python web-crawler

Mar.18,2021
The anti-crawl CAPTCHA pops up when the crawler is running, but my machine requests data is empty, so change the machine, but ip can also request to return data. That's why.
website is "Enterprise search " ...

Python web-crawler

Mar.20,2021
What language is the crawling system like Jinri Toutiao implemented?
* * I would like to ask Senior Daniel two questions 1, java and python. Which two languages are more suitable for crawling systems? 2. In what language is Jinri Toutiao s crawler crawling system written? * * ...

Java python web-crawler Jinri-Toutiao

Apr.02,2021
Construct Ajax request to crawl Ctrip train ticket information and return error content
topic description I want to write a crawler to crawl Ctrip s train ticket information. I found that the ticket information was loaded asynchronously using Ajax, so I constructed a post request. Although headers,data and other data are available, the ...

Python web-crawler

May.23,2021
How does python get all the code within a tag in a piece of html code?
for example, I need all the source code within the < table > tag for special reasons, do not use the page_source method ...

Python web-crawler

Sep.07,2021
Baidu searches and extracts the address inside.
the addresses I found through Baidu search are incomplete, such as https: codeshelper.com a 11. ellipsis is not the same as the one opened. Ask the I requested through the search interface. ...

Python web-crawler

Sep.21,2021
How to reuse an open browser instance by Python Selenium Webdriver
Python Selenium Webdriver reuses an open browser instance ...

Python web-crawler

Sep.26,2021
How does beatuifulsoup get the value of an attribute?
the following code, I want to use beatuifulsoup to get the value of posid (1). How do I write it? <div class="ec_ad_results" posid="1" prank="2" sourceid="160"> ...

Python web-crawler

Sep.27,2021
Error accessing python requests URL
import requests from bs4 import BeautifulSoup import re user_agent = Mozilla 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit 537.36 (KHTML, like Gecko) Chrome 70.0.3521.2 Safari 537.36 headers = { User-Agent :user_agent} url = http: bxjg.bi...

Python web-crawler

Nov.12,2021
How does a python crawler get dynamic table content? Browser displays click without get and other web requests
I want to climb the ip list of the following website https: free-proxy-list.net because every page will be updated with ip, I need to turn the page. At first, I can do it with selenium, but I think the cost is too high. So I want to use requests to...

Html python web-crawler

Nov.25,2021
Get the content of the web page through python according to Firebug's post, but it can't be displayed correctly?
I want to get some ip http: spys.one en free-proxy. of this website. because if I click servers per page to change to 100 or 50, there will be more ip in the table. I check that Firebug, should be a post request, and then I replace headers and param...

Post python web-crawler html requests

Nov.25,2021
Scrapy-redis distributed problem
-sharp! usr bin env python3 __author__ = Stephen import scrapy, json from Espider.tools.get_cookies import get_cookies from scrapy_redis.spiders import RedisSpider from scrapy_redis.utils import bytes_to_str from Espider.items.jingzhunitem import jin...

Python web-crawler

Nov.25,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-43e9c17-55ec.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-43e9c17-55ec.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?