我有一个 df,它在标记为 url 的列中为不同的用户提供了数千个链接,如下所示:
https://www.google.com/something
https://mail.google.com/anohtersomething
https://calendar.google.com/somethingelse
https://www.amazon.com/yetanotherthing
我有以下代码:
import urlparse
df['domain'] = ''
df['protocol'] = ''
df['domain'] = ''
df['path'] = ''
df['query'] = ''
df['fragment'] = ''
unique_urls = df.url.unique()
l = len(unique_urls)
i=0
for url in unique_urls:
i+=1
print "\r%d / %d" %(i, l),
split = urlparse.urlsplit(url)
row_index = df.url == url
df.loc[row_index, 'protocol'] = split.scheme
df.loc[row_index, 'domain'] = split.netloc
df.loc[row_index, 'path'] = split.path
df.loc[row_index, 'query'] = split.query
df.loc[row_index, 'fragment'] = split.fragment
该代码能够正确解析和分割 url,但速度很慢,因为我正在迭代 df.txt 的每一行。有没有更有效的方法来解析 URL?
您可以使用Series.map http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.map.html#pandas.Series.map在一行中完成相同的任务:
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))
使用 timeit,这运行在2.31 ms
每个循环而不是179 ms
当在 186 个 url 上运行时,每个循环与原始方法中一样。 (但请注意,代码未针对重复进行优化,并且将通过 urlparse 多次运行相同的 url。)
完整代码:
import pandas
urls = ['https://www.google.com/something','https://mail.google.com/anohtersomething','https://www.amazon.com/yetanotherthing'] # tested with list of 186 urls instead
df['protocol'],df['domain'],df['path'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse.urlsplit))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)