假设我想提取在某个文本文件中找到的给定字符串后面的第一个单词(或浮点数)(请参阅如何提取字符串后面的第一个单词? https://stackoverflow.com/questions/3549877/how-to-extract-the-first-word-that-follows-a-string)。我知道您可以使用 perl 或 sed 以及可能还有许多其他方式来完成此操作。我正在寻找性能。最快的解析方法是什么?
If you're looking for a fixed string, you probably want to search for it using something like Boyer-Moore or Boyer-Moore-Horspool (for the latter, I'd recommend Ray Gardner's implementation). Note that B-M and B-M-H are both sublinear. Regular expressions, by contrast, are linear at best1, and many implementations (those that use backtracking) are quadratic.
下一步是确保尽快将数据读入内存。事实上,这通常会成为瓶颈。不幸的是,为了很好地处理瓶颈,您通常必须使用一些不可移植的代码。在Linux下,mmap
往往是你最好的选择,而在 Windows 下你是usually最好一次读取大块,然后调用CreateFile
与FILE_FLAG_NO_BUFFERING
旗帜。还值得使用 I/O 完成端口 (IOCP) 来执行读取,这样您就可以并行执行搜索和读取。
1In theory it would be possible to write an RE engine that did sublinear searching for the right kinds of patterns -- but if there's any that actually does, I'm not aware of it.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)