The crawler encountered a special situation.

https://www.lagou.com/gongsi/. URL

clipboard.png

I want to extract the content under this tag < div class= "item_manager_content"

but the first one does not have p and everyone else has p how to deal with this situation?

Dec.04,2021

first of all, follow the crawl without

, assuming that the content of the segment is crawled by content,:

    if content.startswith('

'): content=content[3:] if content.endswith('

'): content=content[:-4]

this kind of incomplete web page is really crappy. It is recommended to use beautifulsoup's html5lib library to parse. It has the best fault tolerance, that is, it is slower


to grasp it uniformly without

, and then if there is

outside, it will be removed

.
Menu