How to organize the format of crawler crawling information?

for example, I need to climb the news and article pages of many websites. I need to extract the title, content, release time and other information of the corresponding page. But the page format of each site is different, do I have to write a crawler for each site?
also, after the information is captured, the format of each website is also different. I need to adjust it to the format of my website. Is there a set of adjustment methods that can be applied to all formats?


1. How to crawl articles from multiple websites?
answer: different websites mean that the html structure and paging format will be different, and different parsers can only be written for different web pages.
2. How to organize crawler information?
answer: you must know what to crawl, don't you? Such as title, content, author, etc., it is nothing more than key-value, to store the value in the corresponding field of the database.

Menu