Scrapy passes parameters across components - Codes Helper - Programming Question Answer

Scrapy passes parameters across components

after starting the framework to crawl the target web page start_url, you need to extract an eigenvalue from the string start_url as the collection name of the MongoDB database, and then store the item through pipeline.
outline flow:

spiderpipeline

related code in pipeline:

import pymongo

class MongoPipeline(object):

    -sharpcollection_name = "Gsl6RoxfN"           

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get("MONGO_URI"),
            mongo_db=crawler.settings.get("MONGO_DATABASE", "items")
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

the question now is how to pass the variable collection_name in spider to pipeline
Thank you for reading
Thanks in advance

Python scrapy

Apr.16,2021

I think there are two ways:
one is to define the collection_name you need as a global variable in the spider module, and then import it into the pipelines module.
second, you can add a collection_name field in item, and you can use item.pop ('collection_name') to pop up in pipelines

.

quoting the method of @ silk can solve the problem and realize the operation of "reading start_url, from MongoDB to process start_url, generating eigenvalues, and then passing eigenvalues to pipeline as the name of the collection table". The specific solution is as follows.

in Spider:

def start_requests(self):
    client = pymongo.MongoClient('localhost',27017)
    db_name = 'Sina'
    db = client[db_name]
    collection_set01 = db['UrlsQueue']
    datas=list(collection_set01.find({},{'_id':0,'url':1,'status':1}))
    for data in datas:
        if data.get('status') == 'pending':
            url=data.get('url')
            pattern='(?<=/)([0-9a-zA-Z]{9})(?=\?)'
            if re.search(pattern,url):
                collection_name=re.search(pattern,url).group(0)
            start_url='https://weibo.cn/comment/'+collection_name+'?ckAll=1'
            collection_set01.update({'url':url},{'$set':{'status':'proccessing'}})                
            break
        else:
            pass
    client.close()
    yield Request(url=start_url,callback=self.parse, cookies=cookie, meta={'collection_name':collection_name})

get start_url, from database, extract eigenvalues, process them, and send request with meta parameter

def parse(self,response):
        collection_name=response.meta['collection_name']
        ......
        for i in range(0,len(node)):
            item['collection_name']=collection_name
            yield item

parse () extracts the returned meta parameters while parsing the data from response

in Pipeline:

def close_spider(self, spider):
    self.db['UrlsQueue'].update({'status':'proccessing'},{'$set':{'status':'finished'}})
    self.client.close()

def process_item(self, item, spider):
    self.collection_name=item.pop('collection_name')
    self.db[self.collection_name].insert_one(dict(item))
    return item

pop, if you lose the collection_name parameter, you can

Thank you very much for @ Yu Bai for your help

Previous: Vue clicks the browser to refresh and always returns to the home page. The route pattern is hash mode.

Next: Custom keyboard is used in html, input is set to readonly property, how to get input focus again?

An error was reported when creating a new scrapy project. The module No module named 'twisted.persisted' was not found.
system: Ubuntu 16.4 python3.6 twisted-15.2.1 Scrapy 1.5.0 is also installed in the virtual environment prompt the following message when creating a Scrapy: (pyvirSpider) root@ubuntu: myScrapy-sharp scrapy startproject test Traceback (most recent...

Python scrapy

Mar.03,2021
What is the order in which Scrapy automatically turns the page and crawls?
recently read Learning Scrapy, which mentions a crawler that automatically turns pages and crawls items on each page. The book says that Scrapy uses last-in, first-out queues. suppose there are 30 items on each page, and start_url is set to the first ...

Python scrapy web-crawler

Mar.11,2021
Scrapy shell xx
when executing the scrapy shell xx URL, there is no response. The stdout in the log file returns , and does not respond to the URL address in quotation marks, and the interface does not respond. Solve used to run successfully, but suddenly failed t...

Python scrapy

Apr.20,2021
How does scrapy make multiple requests in the queue share a proxy ip?
problem description there are 6000 url, to start the celery generation task at 12:00 and send the queue to two servers to crawl. I use middleware to get 10 proxy ip to carry up the request at a time. After 100, I proceed to process the next set of 100...

Python scrapy web-crawler

Mar.09,2022
How does scrapy make multiple requests in the queue share a proxy ip?
problem description there are 6000 url, to start the celery generation task at 12:00 and send the queue to two servers to crawl. I use middleware to get 10 proxy ip to carry up the request at a time. After 100, I proceed to process the next set of 100...

Python scrapy web-crawler

Mar.09,2022
How does scrapy make multiple requests in the queue share a proxy ip?
problem description there are 6000 url, to start the celery generation task at 12:00 and send the queue to two servers to crawl. I use middleware to get 10 proxy ip to carry up the request at a time. After 100, I proceed to process the next set of 100...

Python scrapy web-crawler

Mar.09,2022
How does scrapy make multiple requests in the queue share a proxy ip?
problem description there are 6000 url, to start the celery generation task at 12:00 and send the queue to two servers to crawl. I use middleware to get 10 proxy ip to carry up the request at a time. After 100, I proceed to process the next set of 100...

Python scrapy web-crawler

Mar.09,2022
Invalid scrapy setting logging?
in pipelines, the code is as follows: import logging from scrapy.utils.log import configure_logging configure_logging(install_root_handler=False) logging.basicConfig( filename= log.txt , format= %(levelname)s: %(message)s , level=loggi...

Python scrapy

Mar.26,2022
Use scrapy to climb a website with more than 47000 pages, obviously did not finish climbing, the result ended every two or three hours, showing finish. But I didn't finish the climb.
< H1 > attach the source code of the crawler file. < H1 > import scrapy from openhub.items import OpenhubItem from lxml import etree import json class ProjectSpider(scrapy.Spider): name = project -sharp allowed_domains = [] start_urls ...

Python scrapy

May.25,2022
Not all scrapy download files cannot be opened?
files have been downloaded, the original files are all about 1m, but scrapy downloads are all 3k. As shown in the following picture. ...

Python scrapy

May.25,2022
An error is reported during the operation of scrapy, ModuleNotFoundError: No module named 'pymongo'
I run the single file directly without import errors. In addition, it is normal for me to use mongodb in the py file alone, but when I run it in the scrapy project, I will say that the import failed. Why? import json import pymongo from scrapy.utils.pr...

Mongodb python scrapy python-crawler

Jul.02,2022

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-381c171-37078.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-381c171-37078.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?