How does python determine whether a string contains garbled codes?

how to determine whether there are Chinese garbled codes in python.
similar:
is it true that you are indolent and indolent? Do you know how to cut the raccoon, you know, how to prepare the pickaxe, the hydrogen, the hydrogen. The gallium regulation chain is surprised that the whole world is full of beauty. Forged Betula platyphylla 100%, no, no, no. Is there a trickle of embarrassment in the government?. 6 in the middle of nowhere, there is a great chain of Ying gentry and gentry. What"s wrong with you? Do you know what to do in the village? / p > in the village, the gallium constitution, the bullets, the bullets. In the village of tweezers, the chain is full of horrors, meat, meat and blood.

Mar.01,2021

encode it with unicode, and then transfer it to gb2312,. If an error is reported, it means that there are rare words. For more information, please see python.html" rel=" nofollow noreferrer "> https://jingsam.github.io/201.


it is also good to adopt the method of word segmentation, garbled codes are unlikely to form words (but obscure words are not necessarily garbled).
refer to my following code:

-sharp encoding=utf-8
import jieba

def new_len(iterable):
    try:
      return iterable.__len__()
    except AttributeError:
      return sum(1 for _ in iterable)


normal_str=""
normal_len=len(normal_str)
seg_list = jieba.cut(normal_str)

res = ":"+str(normal_len / new_len(seg_list))

print(res)

luanma_str = "???100%?5 10??.6???/p> ?"
luanma_len = len(luanma_str)
luanma = jieba.cut(luanma_str)


res = ":"+str(luanma_len / new_len(luanma))
print(res)

output results

:2.25
:1.0590062111801242

the normal result is generally more than 2, and the garbled code is very close to 1. It can be considered that the garbled code below 1.2 must be garbled.
can also be transformed into a probability formula.
if the probability is 0.9 at 1 and 0.1 at 2, the following formula
$$P = {1\ over 1 +\ exp {(4.395*x-6.594)} $

can be obtained. In

formula:
xmure-is the ratio of the length of string to the length of participle array
Pmuri-is probability.

this method introduces the stuttering word segmentation module
needs to be installed in advance

pip3 install jieba

see
https://github.com/fxsjy/jieba


jieba parsing is slow

Menu