我遵循了几个在线指南,试图构建一个可以识别并从网站下载所有 pdf 的脚本,从而避免我手动执行此操作。到目前为止,这是我的代码:
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
# connect to website and get list of all pdfs
url="http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
# clean the pdf link names
url_list = []
for el in links:
url_list.append(("http://www.gatsby.ucl.ac.uk/teaching/courses/" + el['href']))
#print(url_list)
# download the pdfs to a specified location
for url in url_list:
print(url)
fullfilename = os.path.join('E:\webscraping', url.replace("http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/", "").replace(".pdf",""))
print(fullfilename)
request.urlretrieve(url, fullfilename)
该代码似乎可以找到所有 pdf(取消注释print(url_list)
看到这个)。然而,它在下载阶段失败。特别是我收到此错误,并且我无法理解出了什么问题:
E:\webscraping>python get_pdfs.py
http://www.gatsby.ucl.ac.uk/teaching/courses/http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016/cribsheet.pdf
E:\webscraping\http://www.gatsby.ucl.ac.uk/teaching/courses/cribsheet
Traceback (most recent call last):
File "get_pdfs.py", line 26, in <module>
request.urlretrieve(url, fullfilename)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\User\Anaconda3\envs\snake\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
有人可以帮我吗?