我用 python 编写了一个脚本,使用 scrapy 来收集网站上不同帖子的名称及其链接。当我从命令行执行脚本时,它可以完美地工作。现在,我的意图是使用运行脚本CrawlerProcess()
。我在不同的地方寻找类似的问题,但我找不到任何直接的解决方案或更接近的解决方案。但是,当我尝试按原样运行它时,出现以下错误:
从 stackoverflow.items 导入 StackoverflowItem
ModuleNotFoundError:没有名为“stackoverflow”的模块
这是到目前为止我的脚本(stackoverflowspider.py
):
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
items.py
包括:
import scrapy
class StackoverflowItem(scrapy.Item):
name = scrapy.Field()
url = scrapy.Field()
这是树:点击查看层次结构 https://i.stack.imgur.com/abQQO.jpg
我知道我可以通过这种方式取得成功,但我只想用上面尝试的方式完成任务:
def parse(self,response):
for link in sel.xpath("//*[@class='question-hyperlink']"):
name = link.xpath('.//text()').extract_first()
url = link.xpath('.//@href').extract_first()
yield {"Name":name,"Link":url}
尽管@Dan-Dev 向我展示了一条通往正确方向的方法,但我决定提供一个对我来说完美无缺的完整解决方案。
除了我在下面粘贴的内容之外,无需更改任何其他内容:
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
from scrapy.crawler import CrawlerProcess
from stackoverflow.items import StackoverflowItem
from scrapy import Selector
import scrapy
class stackoverflowspider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']
def parse(self,response):
sel = Selector(response)
items = []
for link in sel.xpath("//*[@class='question-hyperlink']"):
item = StackoverflowItem()
item['name'] = link.xpath('.//text()').extract_first()
item['url'] = link.xpath('.//@href').extract_first()
items.append(item)
return items
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(stackoverflowspider)
c.start()
再次,在脚本中包含以下内容解决了问题
import sys
#The following line (which leads to the folder containing "scrapy.cfg") fixed the problem
sys.path.append(r'C:\Users\WCS\Desktop\stackoverflow')
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)