Efficiency of massive data deletion in mongodb

there are two collections
data stores data, and
attach stores attachments
data and attach, one-to-many relationships.
the ratio of data to attach is about 1:10, that is, a piece of data data, and there may be 10-50 pieces of data below.
when data is tens of millions, attach may be hundreds of millions.

the problem now is that if the user deletes the data, then the attach must be deleted accordingly.
manipulating hundreds of millions of pieces of data is a long process. The database is also under a lot of pressure.
so consider whether it can be soft deleted, first set the data update status to delete, and ignore attach. Then use the program to clear it in the background in the later stage.
after all, the waiting time for updating tens of millions of data is not long.
but there is a problem. Attach needs to do data statistics. For example, before the user deletes, he calculates that his attachment occupies 20g of space, and after deletion, you have to give him the amount of attachment after deletion, otherwise the billing is not accurate.

so how to solve this problem?

Aug.07,2021

this is a very common problem. In big data's scenarios, soft deletions are often used instead of deletions, so there is no problem with the solution itself. The first point must be done. If you know anything about GridFS, the record in GridFS's fs.files is actually equivalent to your data , while fs.chunks is equivalent to your attach . In fs.files , there is a series of metadata such as file size, file name, path, and so on.
and now that you've marked the file for deletion, you can simply click on the deletion tag Filter when counting the file size. For example, if you delete it through the isDeleted=true tag, the query you need is:

 

the solution I can think of now is to calculate the attach size corresponding to each data, and store it in data. After all, the size is fixed.
so the size of attachments is calculated from data, and it is much faster from tens of millions of pieces of data than from hundreds of millions of pieces of data, and no matter whether the attach is deleted or not, it does not affect the statistical results.

< hr >

I can think of another plan. Data must use uid to identify which user's data, then every time the user empties the operation, I generate a new uid. Then the user's data is completely out of touch with him. Of course, this uid is not the primary key, it's just a temporary identification id,
randomly generated each time, and then I record each delete record. Record the old uid each time you delete it. Then the background cleans up his data through the old uid.
in this way, every time a user deletes an operation, it responds quickly and does not have to wait. However, as a result, the whole database relationship needs to be recombed, and there is indeed a lot of work

Menu