首先你应该了解HTTP协议基于requests https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods。 JavaScript 执行的最终结果将形成 HTTP 请求,让服务器响应文件内容。您需要“反转”网页,找到如何创建正确的请求并尽可能重复它。
那么,让我们尝试一步一步地执行此操作:
- Click right mouse button on element which execute download and press "Inspect element"
- In source code you can see name of JavaScript function this element executes
- Type the name of function in console without parentheses and click button which should appear near console return (This button will open this JavaScript function in source code)
- In source code we see that function execute submit on HTML element which has id
frmDownload
. So, go back to "Inspector" tab and type this id into search box.
-
现在我们发现这个元素是HTMLform https://www.w3schools.com/html/html_forms.asp。此表格发送POST https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST请求 URLhttps://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport
与下一个数据:
searchTerm=
country=
sectoral_scope=0
recentProjects=
sort=projectId
dir=DESC
formatType=csv
这些信息足以尝试在 Python 中重复此请求。
让我们编写一个小脚本,该脚本形成并发送相同的请求并将结果保存到 .csv 文件中:
import requests
data = {
"searchTerm": "",
"country": "",
"sectoral_scope": "0",
"recentProjects": "",
"sort": "projectId",
"dir": "DESC",
"formatType": "csv"
}
file = requests.post("https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport", data)
with open("res.csv", "wb+") as f:
f.write(file.content)
启动它,它......起作用了。res.csv
包含正确的结果。
但这还不是全部。通常一切都不是那么容易。为了让我们的请求看起来与浏览器发送的请求相同,我们应该看一下请求标头 https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers。要捕获来自浏览器的 HTTP 请求,我们可以打开“网络”选项卡:
现在让我们按网页上的下载按钮并下载 csv 文件。现在在请求表中我们可以看到我们的发布请求。单击它并查看“标头”选项卡中的“请求标头”部分。
There's Cookie https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies标头,大多数在请求中并不重要,可能会被忽略。但是,如果您对请求有一些问题,您应该查看以前的请求,使用以下命令查找请求Set-Cookie
服务器响应中的标头并重复它。
让我们改进我们的脚本并复制重要的内容(我们不包括 Host、Content-Length、Connection,因为 Python requests 模块会自动添加它们;DNT 和 Upgrade-Insecure-Requests 根本不需要)来自浏览器的标头。
import requests
data = {
"searchTerm": "",
"country": "",
"sectoral_scope": "0",
"recentProjects": "",
"sort": "projectId",
"dir": "DESC",
"formatType": "csv"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.vcsprojectdatabase.org/",
"Content-Type": "application/x-www-form-urlencoded"
}
file = requests.post("https://www.vcsprojectdatabase.org/services/publicViewServices/fetchProjectsExport", data,
headers=headers)
with open("res.csv", "wb+") as f:
f.write(file.content)
P.S.不要忘记征求网站所有者的许可????