我需要用空格替换所有非 ASCII (\x00-\x7F) 字符。令我惊讶的是,这在 Python 中并不容易,除非我遗漏了一些东西。以下函数只是删除所有非 ASCII 字符:
def remove_non_ascii_1(text):
return ''.join(i for i in text if ord(i)<128)
这个字符根据字符代码点中的字节数(即–
字符被替换为 3 个空格):
def remove_non_ascii_2(text):
return re.sub(r'[^\x00-\x7F]',' ', text)
如何用一个空格替换所有非 ASCII 字符?
Of https://stackoverflow.com/questions/1342000/how-to-replace-non-ascii-characters-in-string the https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii myriad https://stackoverflow.com/questions/6609895/efficiently-replace-bad-characters of https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python similar https://stackoverflow.com/questions/15737048/handle-non-ascii-code-string-in-python SO https://stackoverflow.com/questions/8689795/python-remove-non-ascii-characters-but-leave-periods-and-spaces 问题 https://stackoverflow.com/questions/2921815/help-replacing-non-ascii-character-in-python, none https://stackoverflow.com/questions/17273575/python-replace-non-ascii-characters-in-a-list-of-strings address https://stackoverflow.com/questions/16866261/detecting-non-ascii-characters-in-unicode-string 特点 https://stackoverflow.com/questions/3667875/removing-non-ascii-characters-from-any-given-stringtype-in-python 替代品 https://stackoverflow.com/questions/19000968/what-is-the-correct-way-to-use-unicode-characters-in-a-python-regex as https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string opposed https://stackoverflow.com/questions/3586903/sqlite-remove-non-utf-8-characters to https://stackoverflow.com/questions/15321138/removing-unicode-u2026-like-characters-in-a-string-in-python2-7 剥离 https://stackoverflow.com/questions/18522127/removing-non-ascii-characters-in-a-csv-file, and https://stackoverflow.com/questions/3870084/how-to-decode-a-non-unicode-character-in-python另外,还可以处理所有非 ASCII 字符,而不是特定字符。
Your ''.join()
表达式为过滤,删除任何非 ASCII 的内容;您可以使用条件表达式来代替:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
这会逐一处理字符,并且每个替换字符仍会使用一个空格。
你的正则表达式应该替换连续的带空格的非 ASCII 字符:
re.sub(r'[^\x00-\x7F]+',' ', text)
请注意+
there.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)