Python crawler encountered Javascript is required when crawling web pages

when crawling a web page, the source code cannot be crawled, and < noscript > Javascript is required. is displayed. Please enable javascript before you are allowed to see this page. < / noscript >

went to the forum to search for the question, and found that I was the only one who seemed to have this problem

here is my code

/ / Please paste the code text below (do not replace the code with pictures)
def get_page (page):

url = "http://cambb.cc/forum.php?"
data={
"mod":"forumdisplay",
"fid":"37",
"filter":"",
"orderby":"lastpost",
"page":page,
"t":"1892855",
}
headers={
"Accept-Encoding":"gzip,deflate",
"Accept-Language":"zh-CN,zh;q=0.9",
"User-Agent":"Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/69.0.3497.100Safari/537.36",
"Accept":"*/*",
"Referer":"http://cambb.cc/forum.php?mod=forumdisplay&fid=37",
"X-Requested-With":"XMLHttpRequest",
"Connection":"keep-alive",
"Host":"cambb.cc",
"Cookie":"prhF_2132_saltkey=AzXzRRx7; prhF_2132_lastvisit=1541264796; prhF_2132_nofavfid=1; prhF_2132_smile=1D1; sucuri_cloudproxy_uuid_79c74ae60=0b002e4f44799010d471d1ac30792e43; prhF_2132_auth=d52cMREMy0bxoLJZpaewtEdQ5OoPl%2FMq7ObQuI3%2B%2FtX9wT3KnvTSpZ%2BLHVYBg63fBnzztNHCgpudNYlBodYOPRfY; prhF_2132_lastcheckfeed=5862%7C1541315119; prhF_2132_home_diymode=1; prhF_2132_visitedfid=37D40D2D36; vClickLastTime=a%3A4%3A%7Bi%3A0%3Bb%3A0%3Bi%3A2414%3Bi%3A1541260800%3Bi%3A2459%3Bi%3A1541260800%3Bi%3A2433%3Bi%3A1541260800%3B%7D; prhF_2132_st_p=5862%7C1541318034%7Ce7c1eea8356bda291aed74ccdb20537d; prhF_2132_viewid=tid_2303; prhF_2132_sid=zHIuFg; prhF_2132_lip=182.148.204.234%2C1541317606; prhF_2132_ulastactivity=2369H6awa8fQve9%2BQr8p8MzIVU3ieAbL9rm7Idao%2FRlbwrDEtV2S; prhF_2132_checkpm=1; prhF_2132_sendmail=1; prhF_2132_st_t=5862%7C1541339179%7C6fbad6f782ca1ec63109b566edb0888c; prhF_2132_forum_lastvisit=D_36_1541269244D_40_1541315980D_37_1541339179; prhF_2132_lastact=1541339180%09misc.php%09patch"
}
response = requests.get(url,params=data,headers=headers)
print(response.text)

the following picture is the result of an error report

:

what is the cause of this problem? Yes, it is necessary to simulate and load javascript? first. Would you like to help explain the principle? Thank you. Thank you very much.

Oct.19,2021

this is not a mistake, and even if it is a mistake, it is not the fault of the crawler itself. In addition, do not post screenshots if the error log can be copied.
you will think this is a mistake and cannot solve the problem on your own, which means you are completely unfamiliar with the front end. You should first search for what the noscript tag is, and then figure out what's in the following script tag.
then answer your question. The role of the noscript tag has been mentioned above, but the main purpose is not to anti-crawl, but to give a friendly prompt when the user disables js rather than a blank screen. Notice that there is a script tag under the noscript tag, which obviously means that the rest of the page needs to be loaded by this js, and if the crawler doesn't execute the js, on the page, you certainly can't catch the content you want.
so you can start with this js if you don't want to add a browser dependency to your crawler. This js for loading content simply uses two methods to load the rest of the content:

    The
  1. content itself is already in this js, and this js is only responsible for decrypting
  2. .
  3. this js is responsible for sending ajax request to load the remaining content, and this js is responsible for rendering and / or decrypting the content returned by the request

so all you have to do is figure out what the js did, and then repeat the process with a crawler. Of course, most of this kind of js has been confused and / or compressed. To study it, you need to have a certain grasp of js. For the subject owner, you may need to find someone to help solve this problem.

if you don't mind adding a browser dependency to your crawler, you can use the headless mode of Google browser, which can be found in the python driver of headless chrome.


has not done crawlers, just for reference; you can search noscript and anti-reptiles as keywords.

Menu