What functions should a large web crawler need to meet?

recently, I want to use node to write a crawler tool. On the one hand, I want to nodejs, and on the other hand, I think crawler is a good example to improve the front-end knowledge. But I don"t have much work experience, and I don"t know or use crawler in my work. I"d like to ask the bosses:

1. What should a large reptile look like?
2. What specific functions do you need? (the requirements of a crawler? )
3. How do companies use crawlers?


what do you mean by "a good example of improving the front-end knowledge"? It feels that apart from reverse thinking about how the front end writes the page, how it works and then finding a way to crawl the data, other aspects of the feeling can not improve the front-end knowledge.

  1. the accuracy of general crawler programs cannot be 100%, so you need to do data cleaning after crawling data.
  2. Anti-crawler measures may block IP and accounts if you crawl too fast, so you need to find corresponding measures, such as proxy IP pool, threshold of requests per unit time, etc.
  3. crawlers may need to use distributed crawlers and multithreaded crawlers to improve efficiency;
  4. needs to deal with common anti-crawler measures, and needs to be sensitive to some JS code (at present, front-end projects are mainly built in ways such as webpack, so the code has been confused);
  5. Node.js to do crawler development, you can learn about this project. puppeteer

for the real need to know about front-end development, take a look at this article: talk about front-end development technology evolution from Vue.js


well, I don't know what the standard of a large crawler is. I am now maintaining a crawler (mainly grabbing news articles). I capture 10Wposts a day. I wrote it in Python3+scrapy. First of all, I feel that the knowledge I want to know is as follows:

  • 1, http and https protocols, what do the parameters in those headers mean, and the meaning of the status code
  • 2, to be able to grab packets, sometimes APP data is much easier to grab than web data.
  • 3. To be able to understand the basic js function, you can reverse the Js code, because many anti-crawling measures are the parameters generated by js confusion. To debug js when chrome is powered off
  • 4, the robustness of crawler code, because after sending a url request, the returned data is given by someone else, and there is a lot of uncertainty.
add that it is said that some crawlers have to decompile app's apk package, which is very difficult. Advanced crawlers are also said to understand java, android and other
, and selenium, appium and other efficiency thieves are slow, not suitable for deployment, large-scale crawling (personal feeling)

then your third question: how do companies use crawlers?
I do not have my own experience to answer you, because I have not been to many companies, but I think companies need crawlers, there must be needs to determine, companies need that data, will find ways to get data.

Menu