In a real scrapy project, do you always use the xpath that comes with the framework when using xpath, or will you also use etree.HTML to re-instantiate it as appropriate?

because when crawling the target website, the get data returns a structure in json format, so if you want to parse the html string in the sub-field by xpath, you can"t use response.xpath (or there is another way, I don"t know..). Instead, you can parse the following sub-field of response.text. At this time, you can only re-instantiate xpath. Would you like to ask if this is the correct way to deal with it in the actual project?

Jun.27,2022

generally speaking, scrapy's built-in xpath and css selectors are sufficient, and no other html/xhtml parsers, such as etree or bs4, are needed.

for json content, you can directly call json.loads () to parse, such as

.
js = json.loads(response.body_as_unicode())
js['xxx']

in the future, scrapy may also come with .json () methods (similar to requests libraries).

< H2 > reference < / H2 >

https://docs.scrapy.org/en/la...
https://github.com/scrapy/scr...


The html fragments obtained by

json can be constructed with Selector under scrapy.selector, and parsed with xpath and css selectors

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

you can also use BeautifulSoup, lxml, pyquery and other libraries.

Menu