How to code the words in the thesaurus efficiently?

I have two plans now. One is to grow directly with numbers

.
// let weight=
//     {
//         "": 10,
//         "": 5,
//         "": 7,
//         "": 4,
//         "": 7,
//         "ufo": 3,
//     }

the other is to parse the characters in utf8.

let str=""

function hash(str)
{
    let strcode=0
    for (const iterator of str) 
    {
        strcode += iterator.codePointAt(0).toString(2)
    }
    return strcode
}

console.log(hash(str))
//0101011011111101

but the encoding of both still cannot reduce the amount of data.
calculate this so that the text similarity can be calculated later. Thank you.

Jul.07,2022

the vectorized text before calculating similarity can also use TF-IDF, LSI and other models


coding can not reduce the amount of data, compression can reduce the amount of data.

Menu