我想用双引号替换字符串中的所有单引号,但出现的情况除外,例如“n't”、“'ll”、“'m”等。
input="the stackoverflow don\'t said, \'hey what\'"
output="the stackoverflow don\'t said, \"hey what\""
代码1:(@https://stackoverflow.com/users/918959/antti-haapala https://stackoverflow.com/users/918959/antti-haapala)
def convert_regex(text):
return re.sub(r"(?<!\w)'(?!\w)|(?<!\w)'(?=\w)|(?<=\w)'(?!\w)", '"', text)
有 3 种情况: ' 前面和后面都没有字母数字字符; or 前面不带字母数字字符,但后面带字母数字字符; or 前面有字母数字字符,后面没有字母数字字符。
问题:这不适用于以撇号结尾的单词,即
大多数所有格复数形式,而且它也不适用于非正式场合
以撇号开头的缩写。
代码2:(@https://stackoverflow.com/users/953482/kevin https://stackoverflow.com/users/953482/kevin)
def convert_text_func(s):
c = "_" #placeholder character. Must NOT appear in the string.
assert c not in s
protected = {word: word.replace("'", c) for word in ["don't", "it'll", "I'm"]}
for k,v in protected.iteritems():
s = s.replace(k,v)
s = s.replace("'", '"')
for k,v in protected.iteritems():
s = s.replace(v,k)
return s
太多的单词无法指定,例如如何指定人等。
请帮忙。
Edit 1:我正在使用@anubhava 的精彩答案。我正面临这个问题。有时,该方法会失败的语言翻译。
代码=
text=re.sub(r"(?<!s)'(?!(?:t|ll|e?m|s|d|ve|re|clock)\b)", '"', text)
Problem:
在文本中,“Kumbh melas”melas 是印地语到英语的翻译,而不是复数所有格名词。
Input="Similar to the 'Kumbh melas', celebrated by the banks of the holy rivers of India,"
Output=Similar to the "Kumbh melas', celebrated by the banks of the holy rivers of India,
Expected Output=Similar to the "Kumbh melas", celebrated by the banks of the holy rivers of India,
我正在寻找也许添加一个以某种方式修复它的条件。人为干预是最后的选择。
Edit 2:幼稚而漫长的修复方法:
def replace_translations(text):
d = enchant.Dict("en_US")
words=tokenize_words(text)
punctuations=[x for x in string.punctuation]
for i,word in enumerate(words):
print i,word
if(i!=len(words) and word not in punctuations and d.check(word)==False and words[i+1]=="'"):
text=text.replace(words[i]+words[i+1],words[i]+"\"")
return text
有没有我遗漏的极端情况或者有更好的方法吗?