请注意,您的pn = re.compile(r'@(\S+)')
正则表达式将捕获之后的任何 1+ 非空白字符@
.
排除匹配:
,你需要转换简写\S
上课到[^\s]
否定字符类等效项,并添加:
to it:
pn = re.compile(r'@([^\s:]+)')
现在,它将在第一个之前停止捕获非空白符号:
。请参阅正则表达式演示 https://regex101.com/r/gD8xH9/1.
如果您需要捕捉到最后:
,你只需添加:
捕获组之后:pn = re.compile(r'@(\S+):')
.
至于URL匹配正则表达式,有网上有很多 http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/, 只是选择 https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string最适合您的一种。
这是一个示例代码 https://ideone.com/rgAy2K:
import re
p = re.compile(r'@([^\s:]+)')
test_str = "@galaxy5univ I like you\nRT @BestOfGalaxies: Let's sit under the stars ...\n@jonghyun__bot .........((thanks)\nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str))
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']