我最近为这个问题计时了一堆正则表达式“永远不会与任何内容匹配的正则表达式 https://stackoverflow.com/questions/1723182/a-regex-that-will-never-be-matched-by-anything" (我的回答在这里 https://stackoverflow.com/questions/1723182/a-regex-that-will-never-be-matched-by-anything/47115569#47115569,请参阅 参考资料 了解更多信息)。
然而,经过我的测试,我注意到正则表达式'a^'
and 'x^'
尽管它们应该是相同的,但检查所花费的时间却截然不同。 (我什至只是偶然切换了角色。)这些时间如下。
In [1]: import re
In [2]: with open('/tmp/longfile.txt') as f:
...: longfile = f.read()
...:
In [3]: len(re.findall('\n',longfile))
Out[3]: 275000
In [4]: len(longfile)
Out[4]: 24733175
...
In [45]: %timeit re.search('x^',longfile)
6.89 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [46]: %timeit re.search('a^',longfile)
37.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit re.search(' ^',longfile)
49.8 ms ± 844 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
在线测试(仅前 50 行)显示了相同的行为(1441880 个步骤和约 710 毫秒 vs 仅 40858 个步骤和约 113 毫秒):https://regex101.com/r/AwaHmK/1 https://regex101.com/r/AwaHmK/1
Python 在这里做了什么使得'a^'
花费的时间比'x^'
?
只是为了看看里面有没有发生什么timeit
或者 IPython,我自己编写了一个简单的计时函数,一切都检查完毕:
In [57]: import time
In [59]: import numpy as np
In [62]: def timing(regex,N=7,n=100):
...: tN = []
...: for i in range(N):
...: t0 = time.time()
...: for j in range(n):
...: re.search(regex,longfile)
...: t1 = time.time()
...: tN.append((t1-t0)/n)
...: return np.mean(tN)*1000, np.std(tN)*1000
...:
In [63]: timing('a^')
Out[63]: (37.414282049451558, 0.33898056279589844)
In [64]: timing('x^')
Out[64]: (7.2061508042471756, 0.22062989840321218)
我还在 IPython 之外以标准方式复制了我的结果3.5.2
壳。所以奇怪之处并不局限于 IPython 或timeit
.