When a crawler encounters a front-end page with <p> tags, how do you extract the content you want?

main problem: the front-end code of the web page is very messy, all are

tags, python crawler extraction content is very uncomfortable, BeautifulSoup4 is very difficult to locate, ask for your guidance, how to do in such a situation?
URL: http://eshu.100xuexi.com/uplo.


:

::

my code:

import requests
from bs4 import BeautifulSoup
chapterurl="http://eshu.100xuexi.com/uploads/ebook/e512edf6fac442fbafa2d23e8f2c8c22/mobile/epub/OEBPS/chap9.html"
responce = requests.get(chapterurl)
print(responce.status_code)
responce.encoding = responce.apparent_encoding
res = responce.text
soup = BeautifulSoup(res,"lxml")
-sharp 
chap = soup.find(class_="TocHref").get_text()
print(chap)
-sharp 
TiXings = soup.findAll(class_="TiXing")
for TiXing in TiXings:
    TiXing =TiXing.get_text().strip()
    print(TiXing)
-sharp 

thanks again to all the great gods!


parse it with regular expressions


personally, I think we can only find the rule. All the p tags can be found directly, sliced according to multiple choice and analysis questions, and extracted according to the label rule.
take multiple choice questions as an example, each single and multiple choice question has 8 p tags, and each question has an empty tag interval of < p class= "PSplit" >, which is directly coded:

.
-sharp,
def seg_list(l, n):
    """
    :param l: List,
    :param n: 
    :return: 
    """
    if len(l) < n:
        raise Exception('len() %s.!' % (n,))
    new_list = []
    for i in range(n):
        new_list.append([])
    segment_num = 0
    remainder = 0
    segpoint = int(len(l) / n)
    for num, key in enumerate(l, 1):
        if segment_num < n:
            if num % segpoint != 0:
                new_list[segment_num].append(key)
            else:
                new_list[segment_num].append(key)
                segment_num += 1
        else:
            new_list[remainder].append(key)
            remainder += 1

write the crawler, organize the format, and extract the multiple choice questions:

import requests
from bs4 import BeautifulSoup

url='http://eshu.100xuexi.com/uploads/ebook/e512edf6fac442fbafa2d23e8f2c8c22/mobile/epub/OEBPS/chap9.html'
res=requests.get(url)
res.encoding=res.apparent_encoding
soup=BeautifulSoup(res.text,'lxml')
total_list=soup.select('p')
mcp=total_list[2:299]-sharp
mcp.insert(8,None)-sharp<p class="PSplit">,,
mcp.pop(144)-sharp,,,9p
result_list=seg_list(mcp,33)-sharp33
for i in result_list:
    question=i[0].text
    chioce_A=i[1].text
    chioce_B=i[2].text
    chioce_C=i[3].text
    chioce_D=i[4].text
    answer=i[5].text
    test_point=i[6].text
    analyze=i[7].text
    print([question,chioce_A,chioce_B,chioce_C,chioce_D,answer,test_point,analyze])

as for the analysis questions, there are rules to follow


your problem is actually to convert HTML page format to Excel table format, which can be converted directly online without crawlers.


-sharp 
import re
-sharp content P
content = re.findall('

(.*?)

', html.content.decode('utf-8'), re.S)
Menu