Weight removal of MongoDB

I have written a crawler. When writing the mongo class insert function, I want to add a judgment statement to avoid crawling data to load data repeatedly, but from the actual running situation, it is not feasible. Please take a look at the code

first.
class MongoPipeLine():

    def __init__(self):
        self.client = pymongo.MongoClient(SETTINGS.MONGO_URI,connect=False)
        self.db = self.client[SETTINGS.MONGO_DB]
        self.collection = SETTINGS.COLLECTION

    def insert(self,data):
        if self.db[self.collection].find(data) == 1:
            print("Data has been existed")
        else:
            self.db[self.collection].insert(data)

    def close(self):
        self.client.close()

scheduling function:

spider = Spider()
mongo = MongoPipeLine()
image = ImagePipeLine()

def run(i):
    for page in range(i,i+20):
        response = spider.get_page(page)
        data_list = spider.parse(response)
        for data in data_list:
            mongo.insert(data)
            image.download(data)

if __name__ == "__main__":
    pool = Pool(20)
    pool.map(run,[i*20 for i in range(10)])
    pool.close()
    pool.join()
    mongo.close()

partial screenshot of run result):

Isaiah Rustad Already download
Annie Spratt Already download
Alex Kalinin Already download
Jakob Owens Already download
Emily Henry Already download
  1. you can see from the result that although some images have been downloaded and the corresponding entries are saved in MongoDB, the output from the command side does not have the result of "Data has been existed"". Obviously, the operation after the statement is not performed.
  2. strangely, after I change the judgment statement to"if self.bb [self.collection]. Find (data) = = 0 (data)", the output is the same as the original output, and there is still no "Data has been existed"
  3. .
  4. ask if there are any low-level mistakes that I haven"t considered?

Thank you first and wish you all a happy National Day!

Jul.23,2021

it's strange why find has a range of 0 or 1. Shouldn't you use countDocuments to get the quantity? You output the results of the query.
then what query criteria do you use to judge duplicates? Check to see if there is something wrong with your query conditions.
are you concurrent? do you consider the data inconsistency caused by concurrency? For example, one thread inserts A data, and another thread queries whether there is A data, but in fact, the first thread has not successfully inserted the data.

Menu