A difficult problem of mongodb database design

I really don"t know how to take a title to describe it.

I have two collections

data collection: stores article information, such as status, category, time, etc.
field collection, stores the field information of the article, because the fields of each article are not fixed, such as title (title) content (content), etc.

field a piece of data is as follows

{
   field_key:"title",
   data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
   data:"",
   data_hash:"a46c23269ab827c5f878e766984e4716"//hash
}

the current problem is that I need to detect whether a field exists.

such as checking whether the title field already exists, use such a query

{
    field_key:"title",
    data_hash:"a46c23269ab827c5f878e766984e4716",
}

looks fine, doesn"t it? it"s a simple question. But it"s not.

when the amount of data is large, such as tens of millions of data tables, then the field table may reach hundreds of millions of data.

if the user deletes uid=100 article data, both the field table and the data table will be deleted.
in order to reduce the wait for users, I just soft deleted the data table. For example, the tag status=-1 indicates that the data has been deleted.
after all, deleting and updating field tables with hundreds of millions of pieces of data is a time-consuming operation. So the deletion mechanism is designed like this.

so here comes the problem. According to the above test results, multiple field may be queried. At this time, how do you know if the corresponding field article has been deleted?

so every time you have to check every field you have queried in the data table, his status, puts a lot of pressure on the database.

my solution:

first scenario: I intend to store the value of each field in redis. For example, md5 (field_key+data_hash)
is put into a set, hundreds of millions, and it is estimated that it does not take up much space, so the query performance must have been greatly improved.
but a new problem arises. If the user deletes 1000 of the articles, I need to go to the database to query how many field, are affected and then remove them from the set. This deletion performance may have a slight impact. Besides, it"s a good deal.

the second solution: change the deletion mechanism and update field to status=-1, collectively, but I have not tested it. If hundreds of millions of pieces of data are updated in batches with field collections, such as 1 million pieces of data, how long it will take, and whether the database will be blocked and the business will be affected. So this is risky.

I wonder if there are any other simple and effective solutions?

-supplement-

Why do you want to separate two data tables in the answer? There is a reason for this design:

1. I didn"t make myself clear. I"m talking about the article, which just simplifies the problem. (if I write tens of thousands of words, I guess you won"t bother to look at the problem.)

in fact, it can be article, commodity data, or Weibo data, and the data models we need to store are diversified.

for example, I store the data of a Weibo post, which may have hundreds of thousands of comments, not only comments, but also comment time, comment user information, and so on.
as long as there are restrictions on 16MB, my system has limitations, so I can"t design it this way.

2, there are multiple versions of the content of an article, such as version 1, version 2, version 3. As for how many versions there are, there is no answer. So 16MB must not be enough.

there are various other reasons, so I won"t explain them in detail. But to be sure, two data tables must be separated.

I ended up using redis storage. It seems that this is a better plan, but apart from that, I can"t think of a better one.

3. As for the problem of non-paradigm.

someone may ask, is it not very inconvenient for you to query the data of an article if you store it in this way?

We have another piece of data stored in elasticsearch. The biggest characteristic of elasticsearch is data redundancy and does not follow the normal form.

as described downstairs, this is how we store it in elasticsearch:

// data
{
    // data
    fields: [
        {
           field_key:"title",
           data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
           data:"",
        },
        {
           field_key:"content",
           data_id:ObjectId("5c1beb6c559cd15f2d57cbc8"),
           data:"",
        },
        ...
    ]
}

you only need one query, and everything is out, and that"s not a problem. So we follow the paradigm as much as possible in mongodb.

< hr >

using redis storage, we have observed for a period of time that there are about 10 million of all the contents in set at present. The storage space is about 1G.
that is, 100 million, estimated at 10G, is bound to cause sexual bottlenecks. This is not a good idea either.

-supplement-

I suddenly find myself so stupid, damn it. Redis"s set collection, wouldn"t it be nice to use mongodb storage?
just use redis as the cache. Performance, storage, are not a problem.

Mongodb

Mar.17,2022