对于我的学士论文,我需要从大约 40000 个网站中获取一些数据。因此,我使用 python 请求,但目前从服务器获取响应非常慢。
有没有办法加快速度并保持当前的标题设置?我发现的所有教程都没有标题。
这是我的代码片段:
def parse(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(url, headers=headers)
for line in r.iter_lines():
...
那么你可以使用线程,因为这是一个I/O 限制问题。使用内置的threading
图书馆是您最好的选择。我用的是Semaphore https://docs.python.org/2/library/threading.html#semaphore-objects对象来限制可以同时运行的线程数。
import time
import threading
# Number of parallel threads
lock = threading.Semaphore(2)
def parse(url):
"""
Change to your logic, I just use sleep to mock http request.
"""
print 'getting info', url
sleep(2)
# After we done, subtract 1 from the lock
lock.release()
def parse_pool():
# List of all your urls
list_of_urls = ['website1', 'website2', 'website3', 'website4']
# List of threads objects I so we can handle them later
thread_pool = []
for url in list_of_urls:
# Create new thread that calls to your function with a url
thread = threading.Thread(target=parse, args=(url,))
thread_pool.append(thread)
thread.start()
# Add one to our lock, so we will wait if needed.
lock.acquire()
for thread in thread_pool:
thread.join()
print 'done'
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)