The problem of jieba English Space Segmentation

1. In the case of a keyword with spaces or special symbols, jieba cannot distinguish the word

.

2. Found a solution on github, modify the jieba source code
_ _ init__.py
free sharing, damage exemption.
Open the default dictionary (root directory) or custom dictionary, and change all space spacers used to interval word frequency and part of speech to @ @
(@ @ is chosen because you are less likely to encounter this delimiter in general keywords)

go ahead, open init.py under the root directory of jieba


re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._]+)", re.U)

re_han_default = re.compile("(.+)", re.U)


re_userdict = re.compile("^(.+?)( [0-9]+)?( [a-z]+)?$", re.U)

re_userdict = re.compile("^(.+?)(\u0040\u0040[0-9]+)?(\u0040\u0040[a-z]+)?$", re.U)


word, freq = line.split(" ")[:2]

word, freq = line.split("\u0040\u0040")[:2]

:

re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)

re_han_cut_all = re.compile("(.+)", re.U)

but this results in a large number of emoji expressions or unwanted symbols like =, (),

.

3. Expected output

I just want jieba to recognize Chinese and English keywords with spaces or key words connected with-in custom words and remove other special characters such as the emoji .

how to modify it?

string = "my dog is a happy dog"
jieba.add_word("happy dog")

jieba.cut(my dog is a happy dog)

outputs: ["my","dog","is","a","happy","dog"]

: ["my","dog","is","a","happy dog"]

is really big on regular expressions. I hope experienced bosses can tell me what to do.

Apr.09,2021

you only need to add spaces to the corresponding regular expression, such as

-re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._%]+)", re.U)
+re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._% ]+)", re.U)

example

In [1]: import re

In [2]: import jieba

In [3]: s = 'my dog is a happy dog'

In [4]: list(jieba.cut(s))
Out[4]: ['my', ' ', 'dog', ' ', 'is', ' ', 'a', ' ', 'happy', ' ', 'dog']

In [5]: jieba.re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._% ]+)", re.U)

In [6]: jieba.add_word('happy dog')

In [7]: list(jieba.cut(s))
Out[7]: ['my', ' ', 'dog', ' ', 'is', ' ', 'a', ' ', 'happy dog']

For

English, you can still consider nltk . You can try MWETokenizer

import nltk
tokenized_string = nltk.word_tokenize("my dog is a happy dog")
mwe = [('happy', 'dog')] -sharp (phrase)
mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe)
result = mwe_tokenizer.tokenize(tokenized_string)

jibe if there is a solution, please add.

Menu