在尝试设置详细的正则表达式时:
# set up variables
ankerwords = ['beerdigt','bestattet','begraben','beigesetzt']
# combine the words, five words before/after
rx = re.compile(r'''
(?:\b\w+\W+){5} # five words before
(?:{})
(?:\W+\w+\b){5} # five words thereafter
'''.format("|".join(ankerwords)), re.X)
这会引发错误IndexError: tuple index out of range
.
I know it's because of the
{5}
in the expression but how to get around it
without splitting the string in several parts, i.e.
'''(?:\b\w+\W+){5}''' + '(?:{})'.format(...)
实际上,这更多的是风格问题。
它告诉我们,使大括号的工作量加倍format
将大括号视为普通字符(它会转义它们:如何在 python 字符串中打印文字大括号字符并在其上使用 .format ? https://stackoverflow.com/questions/5466451/how-can-i-print-literal-curly-brace-characters-in-python-string-and-also-use-fo):
rx = re.compile(r'''
(?:\b\w+\W+){{5}} # five words before
(?:{})
(?:\W+\w+\b){{5}} # five words thereafter
'''.format("|".join(ankerwords)), re.X)
或使用旧样式%
格式:
rx = re.compile(r'''
(?:\b\w+\W+){5} # five words before
(?:%s)
(?:\W+\w+\b){5} # five words thereafter
''' % ("|".join(ankerwords)), re.X)
在这种情况下另一种方式,因为{5}
是重复的,可能是这样的:
rx = re.compile(r'''
(?:\b\w+\W+){five} # five words before
(?:{expr})
(?:\W+\w+\b){five} # five words thereafter
'''.format(expr="|".join(ankerwords),five="{5}", re.X)
(这避免了双括号并允许一劳永逸地“参数化”单词数量)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)