The crawler encountered a special situation. URL


I want to extract the content under this tag < div class= "item_manager_content"

but the first one does not have p and everyone else has p how to deal with this situation?


first of all, follow the crawl without

, assuming that the content of the segment is crawled by content,:

    if content.startswith('

'): content=content[3:] if content.endswith('

'): content=content[:-4]

this kind of incomplete web page is really crappy. It is recommended to use beautifulsoup's html5lib library to parse. It has the best fault tolerance, that is, it is slower

to grasp it uniformly without

, and then if there is

outside, it will be removed
