The problem of jieba English Space Segmentation

1. In the case of a keyword with spaces or special symbols, jieba cannot distinguish the word

2. Found a solution on github, modify the jieba source code
_ _ init__.py
free sharing, damage exemption.
Open the default dictionary (root directory) or custom dictionary, and change all space spacers used to interval word frequency and part of speech to @ @
(@ @ is chosen because you are less likely to encounter this delimiter in general keywords)

go ahead, open init.py under the root directory of jieba


re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._]+)", re.U)

re_han_default = re.compile("(.+)", re.U)


re_userdict = re.compile("^(.+?)( [0-9]+)?( [a-z]+)?$", re.U)

re_userdict = re.compile("^(.+?)(\u0040\u0040[0-9]+)?(\u0040\u0040[a-z]+)?$", re.U)


word, freq = line.split(" ")[:2]

word, freq = line.split("\u0040\u0040")[:2]

:

re_han_cut_all = re.compile("([\u4E00-\u9FD5]+)", re.U)

re_han_cut_all = re.compile("(.+)", re.U)

but this results in a large number of emoji expressions or unwanted symbols like =, (),

3. Expected output

I just want jieba to recognize Chinese and English keywords with spaces or key words connected with-in custom words and remove other special characters such as the emoji .

how to modify it?

string = "my dog is a happy dog"
jieba.add_word("happy dog")

jieba.cut(my dog is a happy dog)

outputs: ["my","dog","is","a","happy","dog"]

: ["my","dog","is","a","happy dog"]

is really big on regular expressions. I hope experienced bosses can tell me what to do.

In [1]: import re In [2]: import jieba In [3]: s = 'my dog is a happy dog' In [4]: list(jieba.cut(s)) Out[4]: ['my', ' ', 'dog', ' ', 'is', ' ', 'a', ' ', 'happy', ' ', 'dog'] In [5]: jieba.re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+-sharp&\._% ]+)", re.U) In [6]: jieba.add_word('happy dog') In [7]: list(jieba.cut(s)) Out[7]: ['my', ' ', 'dog', ' ', 'is', ' ', 'a', ' ', 'happy dog']

import nltk tokenized_string = nltk.word_tokenize("my dog is a happy dog") mwe = [('happy', 'dog')] -sharp (phrase) mwe_tokenizer = nltk.tokenize.MWETokenizer(mwe) result = mwe_tokenizer.tokenize(tokenized_string)