示例——维基百科日本检索(keyword:漫画)
https://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2&limit=20&offset=0&ns0=1&search=%E6%BC%AB%E7%94%BB&advancedSearch-current={}
此为该翻页URL链接的第一页
观察该网页URL变化,可发现offset每页递增20
常规写法如下
for i in range(0,1000,20): #以50页为例,i每页增加20
url = "https://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2&limit=20&offset={}&ns0=1&search=%E6%BC%AB%E7%94%BB&advancedSearch-current={}".format(i)
此时运行程序报错:
tuple index out of range
元组数量超出范围,观察format格式化输出,URL链接中存在两个{},但format提供的参数只有i
原URL中第二个{}为空,不含任何参数,故我们赋予format格式化输出中第二个参数为None
for i in range(1,1000,20):
url = 'https://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2&limit=20&offset={}&ns0=1&search=%E6%BC%AB%E7%94%BB&advancedSearch-current={}'.format(i,None)
观察输出结果
此时可正确输出URL链接
贴上源代码
def get_url(header):
for i in range(0,1000,20):
try:
requests.packages.urllib3.disable_warnings()
url = 'https://ja.wikipedia.org/w/index.php?title=%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2&limit=20&offset={}&ns0=1&search=%E6%BC%AB%E7%94%BB&advancedSearch-current={}'.format(i,None)
selector = etree.HTML(requests.get(url, headers=header,verify=False).text)
urls = selector.xpath('//ul[@class="mw-search-results"]/li[@class="mw-search-result"]//a/@href')
for one in urls:
if 'http' not in one:
one = 'https://ja.wikipedia.org' + one
print(one)
with open("维基百科_url.txt", "a", encoding="utf-8") as w:
w.write(one+"\n")
except Exception as e:
print(e)
continue