我有两个 unicode 字符,两者具有相同的含义。这compat
字符是对origin
字符,这有意义,两者应该是相同的值,但是当我试图断言它们与条件相等时,它会返回False
反而。
origin = 'ᅢ' # korean letter for: AE
compat = 'ㅐ' # korean letter for: AE
print('origin', ascii(origin))
print('compat', ascii(compat), '\n')
decompose_origin = unicodedata.decomposition(origin)
decompose_compat = unicodedata.decomposition(compat)
print('decompose: origin', decompose_origin)
print('decompose: compat', decompose_compat, '\n')
# expected output: True
print(decompose_origin == decompose_compat)
origin '\u1162'
compat '\u3150'
decompose: origin
decompose: compat <compat> 1162
False
将字符串标准化为NFKC or NFKD正常形式 https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize使它们具有可比性:
from unicodedata import normalize
origin = '\u1162'
compat = '\u3150'
for normal_form in ('NFC', 'NFD', 'NFKC', 'NFKD'):
print(normal_form, ascii(normalize(normal_form, origin + ' == ' + compat)))
print(normalize(normal_form, origin) == normalize(normal_form, compat))
# NFC '\u1162 == \u3150'
# False
# NFD '\u1162 == \u3150'
# False
# NFKC '\u1162 == \u1162'
# True
# NFKD '\u1162 == \u1162'
# True
Both NFKC
and NFKD
执行“兼容性分解,即用其等效字符替换所有兼容性字符”。这NFKC
范式也适用规范组合。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)