CSDN file crawler

problem description

in CSDN, log in as a member, normally click this button to download the file
clipboard.png
a url

can always crawl the file according to this url, but recently may have taken some measures, so that click on the url below can not download the file, that is, the crawler crawled below the url can not climb the file, return is a 404 web page.

related codes

requests library of python used.
this is the download function in the crawler class.

def download(self, remote_url, local_dir):

    -sharp 1.
    if not self.__is_logined:
        self.__login()

    -sharp +1
    self.download_count += 1

    count = 0
    while count < 3:
        count += 1

        -sharp 2.URL
        html_text = self.__session.get(remote_url).text
        html = BeautifulSoup(html_text, "html5lib")
        real_url = html.find("a", id="vip_btn").attrs["href"]

        -sharp 3.
        source = self.__session.get(real_url)

        -sharp 3.1
        filename = re.findall(r".*\"(.*)\"$", source.headers.get("Content-Disposition", "\"None\""))[0]
        if filename == "None":
            continue
        filename = re.sub("\s", "_", filename)

        -sharp 3.2
        if not os.path.exists(local_dir):
            os.makedirs(local_dir)
        _local_path = local_dir + filename

        -sharp 3.3
        local_file = open(_local_path.encode("utf-8"), "wb")
        for file_buffer in source.iter_content(chunk_size=512):
            if file_buffer:
                local_file.write(file_buffer)
        return _local_path

    return None

run according to the above code, return a 404 web page, ask the god how to correctly climb to the file ~

Jun.06,2022

check whether header is missing cookie or related parameters

Menu