增强的正则表达式解决方案
假设您确实关心处理:Mr.
and Mrs.
等缩写,那么以下单个正则表达式解决方案效果很好:
<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
# Split sentences on whitespace between them.
# See: http://stackoverflow.com/a/5844564/433790
(?<= # Sentence split location preceded by
[.!?] # either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # But don\'t split after these:
Mr\. # Either "Mr."
| Mrs\. # Or "Mrs."
| Ms\. # Or "Ms."
| Jr\. # Or "Jr."
| Dr\. # Or "Dr."
| Prof\. # Or "Prof."
| Sr\. # Or "Sr."
| T\.V\.A\. # Or "T.V.A."
# Or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences,
(?=\S) # (but not at end of string).
%xi'; // End $split_sentences.
$text = 'This is sentence one. Sentence two! Sentence thr'.
'ee? Sentence "four". Sentence "five"! Sentence "'.
'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
'Jones said: "Mrs. Smith you have a lovely daught'.
'er!" The T.V.A. is a big project! '; // Note ws at end.
$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>
请注意,您可以轻松地在表达式中添加或删除缩写。给出以下测试段落:
这是第一句话。第二句话!第三句?句“四”。句“五”!句子“六”?句子“七”。句子“八!”琼斯医生说:“史密斯夫人,您有一个可爱的女儿!” T.V.A.是一个大工程!
这是脚本的输出:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
基本的正则表达式解决方案
问题的作者评论说上述解决方案“忽略了很多选择”并且不够通用。我不确定这意味着什么,但上述表达式的本质是尽可能干净和简单的。这里是:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
请注意,这两种解决方案都可以正确识别结尾标点符号后以引号结尾的句子。如果您不关心匹配以引号结尾的句子,则正则表达式可以简化为:/(?<=[.!?])\s+(?=\S)/
.
编辑:20130820_1000 Added T.V.A.
(另一个要忽略的标点词)用于正则表达式和测试字符串。 (回答PapyRef的评论问题)
编辑:20130820_1800整理并重命名了正则表达式,并添加了 shebang。还修复了正则表达式,以防止在尾随空格上分割文本。