Is there not much reference value for the similarity between the word vector and the word vector obtained by word2vec?

in the word2vec model trained with Chinese characters, there does not seem to be much correlation between the input of a word vector and the output, and I do not know whether it is normal or not.
example, the corpus is made up of some components, input "horse", model.most_similar output, top10 results do not have "reach", a little surprised, I do not know whether it is normal.
how much is the dimension of another, character embedding?
ask seniors for advice, thank you.


your situation is normal. According to the principle of word2vec, it is not that the word vectors that often appear adjacent to each other have high similarity (such as the example of "motor" you cited), but that the word vectors of characters with very close context are highly similar, such as "pain" and "pain", "apple" and "pear". These characters or words often appear in a similar context. It will lead to the high similarity of their word vectors.
in addition, for the setting of the embedding dimension, generally grammatical tasks (such as named entity recognition, word segmentation, etc.) can be set lower (100,300), while semantic tasks (such as emotion analysis, etc.) can be set slightly higher (300,500).

Menu