How to solve the problem that JS prevents crawling websites?

due to the need of work, it is necessary for the program to regularly get the news posted on the website at the address of http://cnda.cfda.gov.cn/WS04/..

tried PHP,NODEJS,C-sharp failed to crawl, there is a JS in the site will prevent the backend of the crawl, the catch is that the JS, behind the content will not continue to load.

clipboard.png

it"s more challenging, you can take a look.
PS: must be crawled with the backend, and it is meaningless to see the page on the browser.

Sep.16,2021

selenium


found a related blog post, which can be broken through the side of APP.

https://my.oschina.net/hengba.

in this case, it is not a js that prevents crawling, but a js that is responsible for loading the rest of the content. Now some frames also use js to load the content. If you disable js in your browser and then open it, you should have the same effect. You can just dig out the api of this site from that js, or decrypt the real content of the site from that js. The question to be determined here is whether this js is responsible for loading (and decrypting) the remaining content from the server, or for decrypting the remaining content directly (the remaining content is already included in this js)

of course, if your crawl scale is small, just open a headless chrome crawl


find the api interface and get the content according to the api interface!


this website is disgusting. There is a js on the page that will call debugger, and then download it to js and call debugger it.

if you want to solve the problem directly in the browser console, you have to block these two scripts.

first right-click on document and select screen Js, as shown below:

Network:

Menu