如果这是一个重复的问题,我很抱歉,但我在 SO 或其他地方找不到另一个问题来处理我需要的内容。这是我的问题:
我在用着scrapy
从中获取一些信息this http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html网页。为了清楚起见,以下是该网页的源代码块,我对此感兴趣:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
</span><br/><br/<br/>
该页面上的几乎所有代码都类似于上面的代码块。
从这一切中,我需要抓住:
- ANT101H5 生物人类学和考古学导论
- 排除:ANT100Y5
- 先决条件:ANT102H5
问题是Exclusion:
是在一个里面<span class="title2">
and ANT100Y5
是在下面的里面<a>
.
我似乎无法从源代码中获取它们。目前,我有尝试(但失败)抓取的代码ANT100Y5
看起来像:
hxs = HtmlXPathSelector(response)
sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")
我很感激任何对此的帮助,即使它是“你因为没有看到另一个完美回答这个问题的问题而盲目”(在这种情况下,我自己将投票结束这个问题)。我实在是无计可施了。
提前致谢
编辑:在@Dimitre建议的更改后完成原始代码
我正在使用以下代码:
class regcalSpider(BaseSpider):
name = "disc"
allowed_domains = ['www.utm.utoronto.ca']
start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']
def parse(self, response):
items = []
hxs = HtmlXPathSelector(response)
sites = hxs.select("/*/p/text()[1] | \
(//span[@class='title2'])[1]/text() | \
(//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
(//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text()")
for site in sites:
item = RegcalItem()
item['title'] = site.select("a/text()").extract()
item['link'] = site.select("a/@href").extract()
item['desc'] = site.select("text()").extract()
items.append(item)
return items
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
这给了我这个结果:
[{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []},
{"title": [], "link": [], "desc": []}]
这不是我需要的输出。我究竟做错了什么?请记住,我正在运行此脚本this http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html, 如上所述。