Python crawler encountered Javascript is required when crawling web pages - Codes Helper - Programming Question Answer

Python crawler encountered Javascript is required when crawling web pages

when crawling a web page, the source code cannot be crawled, and < noscript > Javascript is required. is displayed. Please enable javascript before you are allowed to see this page. < / noscript >

went to the forum to search for the question, and found that I was the only one who seemed to have this problem

here is my code

/ / Please paste the code text below (do not replace the code with pictures)
def get_page (page):

url = "http://cambb.cc/forum.php?"
data={
"mod":"forumdisplay",
"fid":"37",
"filter":"",
"orderby":"lastpost",
"page":page,
"t":"1892855",
}
headers={
"Accept-Encoding":"gzip,deflate",
"Accept-Language":"zh-CN,zh;q=0.9",
"User-Agent":"Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/69.0.3497.100Safari/537.36",
"Accept":"*/*",
"Referer":"http://cambb.cc/forum.php?mod=forumdisplay&fid=37",
"X-Requested-With":"XMLHttpRequest",
"Connection":"keep-alive",
"Host":"cambb.cc",
"Cookie":"prhF_2132_saltkey=AzXzRRx7; prhF_2132_lastvisit=1541264796; prhF_2132_nofavfid=1; prhF_2132_smile=1D1; sucuri_cloudproxy_uuid_79c74ae60=0b002e4f44799010d471d1ac30792e43; prhF_2132_auth=d52cMREMy0bxoLJZpaewtEdQ5OoPl%2FMq7ObQuI3%2B%2FtX9wT3KnvTSpZ%2BLHVYBg63fBnzztNHCgpudNYlBodYOPRfY; prhF_2132_lastcheckfeed=5862%7C1541315119; prhF_2132_home_diymode=1; prhF_2132_visitedfid=37D40D2D36; vClickLastTime=a%3A4%3A%7Bi%3A0%3Bb%3A0%3Bi%3A2414%3Bi%3A1541260800%3Bi%3A2459%3Bi%3A1541260800%3Bi%3A2433%3Bi%3A1541260800%3B%7D; prhF_2132_st_p=5862%7C1541318034%7Ce7c1eea8356bda291aed74ccdb20537d; prhF_2132_viewid=tid_2303; prhF_2132_sid=zHIuFg; prhF_2132_lip=182.148.204.234%2C1541317606; prhF_2132_ulastactivity=2369H6awa8fQve9%2BQr8p8MzIVU3ieAbL9rm7Idao%2FRlbwrDEtV2S; prhF_2132_checkpm=1; prhF_2132_sendmail=1; prhF_2132_st_t=5862%7C1541339179%7C6fbad6f782ca1ec63109b566edb0888c; prhF_2132_forum_lastvisit=D_36_1541269244D_40_1541315980D_37_1541339179; prhF_2132_lastact=1541339180%09misc.php%09patch"
}
response = requests.get(url,params=data,headers=headers)
print(response.text)

the following picture is the result of an error report

:

what is the cause of this problem? Yes, it is necessary to simulate and load javascript? first. Would you like to help explain the principle? Thank you. Thank you very much.

Web-crawler python

Oct.19,2021

this is not a mistake, and even if it is a mistake, it is not the fault of the crawler itself. In addition, do not post screenshots if the error log can be copied.
you will think this is a mistake and cannot solve the problem on your own, which means you are completely unfamiliar with the front end. You should first search for what the noscript tag is, and then figure out what's in the following script tag.
then answer your question. The role of the noscript tag has been mentioned above, but the main purpose is not to anti-crawl, but to give a friendly prompt when the user disables js rather than a blank screen. Notice that there is a script tag under the noscript tag, which obviously means that the rest of the page needs to be loaded by this js, and if the crawler doesn't execute the js, on the page, you certainly can't catch the content you want.
so you can start with this js if you don't want to add a browser dependency to your crawler. This js for loading content simply uses two methods to load the rest of the content:

content itself is already in this js, and this js is only responsible for decrypting
this js is responsible for sending ajax request to load the remaining content, and this js is responsible for rendering and / or decrypting the content returned by the request

so all you have to do is figure out what the js did, and then repeat the process with a crawler. Of course, most of this kind of js has been confused and / or compressed. To study it, you need to have a certain grasp of js. For the subject owner, you may need to find someone to help solve this problem.

if you don't mind adding a browser dependency to your crawler, you can use the headless mode of Google browser, which can be found in the python driver of headless chrome.

has not done crawlers, just for reference; you can search noscript and anti-reptiles as keywords.

Previous: Problems with files with abc names in Pycharm

Next: Why does $response- > headers- > set in Laravel have no effect?

An error occurred when Python3 crawled the short rent of Piglet.
just contacted python, according to https: blog.csdn.net mtbaby . wanted to crawl piglet short rent information, but then IP was blocked. then looks at the problem of agent ip , but still can t get the information . import requests from lxml im...

Web-crawler python

Feb.28,2021
How to clean up some unwanted HTML attributes in crawler data
for example, for the following data <p id="a">data I just want to keep data is there a quick way to do this? ...

Web-crawler python pyspider scrapy

Mar.01,2021
There is a problem that we can't get the playback information continuously when using bilibili api to obtain the playback information.
api: http: api.bilibili.com x web. there are already 70w aid, in the library every morning to get video playback updates by aid , and then there is a sudden problem in the early hours of this morning. Every time we get 200,300 pieces of data, there w...

Web-crawler python

Mar.02,2021
I would like to ask why this situation can not crawl the content of the tag.
as shown in the figure, only the tag is returned, but the content is gone. I haven t been learning crawlers for long, and I don t know why I m wrong. ...

Web-crawler python

Mar.02,2021
The < script > tags in html are all exactly the same. How can you tell the difference?
<html> <srcipt > 1 <srcipt > 2 .... < html> there must be no problem when loading. If I want to get a specified srcipt tag, I can get the element by getting the < script > array and then using the su...

Requests web-crawler python javascript

Mar.03,2021
Python 3.6Readwrite file transcoding
I picked the code of a website. How can I write it to the txt document? how can I write it to the document? here is my code and error report ...

Web-crawler python

Mar.03,2021
Simulated login pull hook net, one of the parameters in post's form is that signature, is generated as soon as it enters the login interface without entering account information, but I can't find it.
simulate login pull hook. One of the parameters in post s form is that signature, is generated as soon as it enters the login interface without entering account information, but I can t find . there is a result of searching signature in html with F...

Web-crawler python

Mar.05,2021
Multiple scrapy-redis cannot be crawled at the same time
Open two scrapy tasks at the same time, and then go to push in redis a start_url but only one scrapy task An is running, and when An is stopped, B task will begin to crawl. the reason seems to be that requests is not saved in redis while...

Scrapyd scrapy web-crawler python-crawler python

Mar.05,2021
When using selenium to drive chrome to find certain elements, the website cannot be found. It is a course learning platform.
after I log in to the website through selenium, I want to start automatically clicking some buttons on the web page. Through xpath positioning, I can t find . The code is as follows (account password is not important, you need to log in to enter the...

Selenium chrome web-crawler python

Mar.09,2021
How to determine the date element in python requests.post?
how does the date element in requests.post determine when building a crawler request such as requests.post (url, data=post_data)-sharp pseudo code the content of this post_data is different when crawling different websites. how should this content...

Post web-crawler python

Mar.12,2021
Requests cookies simulated login encountered problems
as mentioned above, I tried to use cookies to simulate login to www.jianshu.com, but failed. Come here to find some ideas. the process of simulation: f12 cookies,cookies network found a little too much, first added all of it, found that it didn t wor...

Requests web-crawler python

Mar.14,2021
Weibo scrolling load crawler problem
when browsing someone s Weibo home page, not all of the content will be loaded. It is divided into three loads. when I scroll to a location, I will initiate another request. but the content doesn t exist, and the request address is the same, a...

Web-crawler python

Mar.14,2021
How to write selenium in scrapy
...

Web-crawler python

Mar.16,2021
According to an example to write a program to crawl amazon pages, but there are many mistakes, do not understand, ask for help!
crawl the title and price of goods in Amazon China, Mobile phone-> Mobile Communications-> Apple Phone. its URL= https: www.amazon.cn s ref=s. my python code is as follows: import requests from bs4 import BeautifulSoup import re -sharpHTML import ti...

Web-crawler python

Mar.16,2021
Python selenium crawler
option.add_argument ( --start-maximized ) self.driver.maximize_window () what is the maximum difference between the two ...

Web-crawler python

Mar.17,2021
Dianping's latest anti-crawling: identify dynamic second-cut agent IP?
I have been climbing the front page of Dianping s store recently. Url is similar to http: m.dianping.com shop 4094416. Because Dianping has anti-crawling against IP, I built a dynamic IP tunnel that can switch IP, in seconds, that is, to change an IP...

Web-crawler python

Mar.18,2021
Why can't selenium search be located?
**** ...

Web-crawler python

Mar.18,2021
Check selenium does not return content
...

Web-crawler python

Mar.18,2021
Check to find out how the search is anti-crawling.
https: www.qichacha.com I climbed with a headless browser, simulated search keywords for dynamic ip 5 seconds for a do not log in, you can start to search keywords, but later can not, I do not know through what anti-climbing? ...

Web-crawler python

Mar.19,2021
Why can't my xpaht match?
<item> <title> <![CDATA[ IP ]]> < title> <link> <![CDATA[ cyzone_title_list=etree.HTML (response.text.encode ( utf-8 )) .XPath ( item title text () ) isn t text in title? http: www.cyzone.cn rss link ...

Web-crawler python

Mar.20,2021

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-3835f43-37d65.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-3835f43-37d65.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?