分裂的魔力。我最初的假设在技术上是不正确的(尽管更容易找到解决方案)。那么让我们检查一下你的分割模式:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
我稍微重新安排了一下。外括号不是必需的,我最后将单个字符移动到字符类中:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
这是为了预先进行一些排序。我们称这种模式为分割模式,s
简而言之并定义它。
您想要匹配不属于 split-at 模式中的那些字符且至少三个字符的所有部分。
我可以通过以下模式实现这一点,包括支持正确的分割序列和 unicode 支持。
$pattern = '/
(?(DEFINE)
(?<s> # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
或者更小的:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
同样的原理可以用于preg_split
以及。有点不同:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?<s>
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split @ split, at least
/ux';
Usage:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
这些例程按要求工作。但这确实有其价格和性能。成本与旧答案类似。
相关问题:
- 与正则表达式的反匹配 https://stackoverflow.com/q/4660818/367456
- 按分隔符分割字符串,但如果它被转义则不会 https://stackoverflow.com/q/6243778/367456
旧答案,进行两步处理(首先拆分,然后过滤)
因为您使用的是分割例程,所以无论长度如何,它都会分割。
所以你能做的就是过滤结果。您可以使用正则表达式再次执行此操作(preg_filter http://php.net/preg_filter),例如,删除所有较小的三个字符:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
Result:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)