我想抓取页面上所有问题的链接和标题。一个元素具有以下结构:
<a data-click-id="body" class="SQnoC3ObvgnGjWt90zD9Z" href="/r/excel/comments/ayiahc/calculating_expiration_dates_previous_solution_no/">
<h2 class="s1okktje-0 cDxKta">
<span style="font-weight:normal">Calculating Expiration Dates - Previous Solution No Longer Works</span>
</h2>
</a>
I use questions = driver.find_elements_by_xpath('//a[@data-click-id="body"]')
得到问题然后迭代它们for
。我可以用question.get_attribute('href')
获取链接。
但是,我不知道如何提取其中的标题span
(来自一个question
).
有谁知道如何做到这一点?
含硒
question.find_elements_by_xpath.('./h2/span').text
将返回 for 循环中底层 span 元素的文本元素
与 lxml
import requests
from lxml import html
UA = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0'}
page = requests.get('https://www.reddit.com/search?q=Expiration&type=link&sort=new',
headers = UA)
tree = html.fromstring(page.content)
questions = tree.xpath('//a[@data-click-id="body"]')
parsed_q = []
for question in questions:
url = question.xpath('./@href')[0]
title = question.xpath('./h2/span/text()')[0]
print("Title: {} --- URL: {}".format(title,url))
parsed_q.append(tuple([title,url]))
print(parsed_q)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)