好吧,我对一般编程很陌生,并且具体使用 Scrapy 来实现此目的。我编写了一个爬虫来从 pinterest.com 上的 pin 获取数据。问题是我以前从我正在抓取的页面上的所有引脚获取数据,但现在我只获取第一个引脚的数据。
我认为问题出在管道或蜘蛛本身。在我向蜘蛛添加“条带”以消除空格后,事情发生了变化,但是当我将其改回来时,我得到了相同的输出,但随后又出现了空格。这是蜘蛛:
from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem
class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = ["http://www.pinterest.com/llbean/pins/"]
def parse(self, response):
hxs = Selector(response)
item = PinterestItem()
items = []
item ["pin_link"] = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()[0].strip()
item ["repin_count"] = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
item ["like_count"] = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
item ["board_name"] = hxs.xpath("//div[@class='creditTitle']/text()").extract()[0].strip()
items.append(item)
return items
这是我的管道:
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.contrib.exporter import JsonLinesItemExporter
class JsonLinesExportPipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signals.spider_opened)
dispatcher.connect(self.spider_closed, signals.spider_closed)
self.files = {}
def spider_opened(self, spider):
file = open('%s_items.json' % spider.name, 'w+b')
self.files[spider] = file
self.exporter = JsonLinesItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
当我使用命令“scrapycrawlpinterest”时,这是我在 JSON 文件中得到的输出:
"pin_link": "/pin/94716398388365841/", "board_name": "Outdoor Fun", "like_count": "14", "repin_count": "94"}
这正是我想要的输出,但我仅从一个引脚获得它,而不是从页面上的所有引脚获得它。我花了很多时间阅读类似的问题,但找不到任何类似的问题。有什么问题的想法吗?提前致谢!
编辑:哦,我猜这是因为 strip 函数之前的 [0] ?抱歉,我刚刚意识到这可能是问题所在......
编辑:嗯,那不是问题。我很确定它必须与剥离功能有关,但我似乎无法正确使用它来获得多个引脚作为输出。解决方案可以作为这个问题的一部分吗?:Scrapy:为什么提取的字符串是这种格式? https://stackoverflow.com/questions/17000640/scrapy-why-extracted-strings-are-in-this-format我看到一些重叠,但我不知道如何使用它。
编辑:好的,当我像这样修改蜘蛛时:
from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem
class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = ["http://www.pinterest.com/llbean/pins/"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath("//div[@class='pinWrapper']")
items = []
for site in sites:
item = PinterestItem()
item ["pin_link"] = site.select("//div[@class='pinHolder']/a/@href").extract()[0].strip()
item ["repin_count"] = site.select("//em[@class='socialMetaCount repinCountSmall']/text()").extract()[0].strip()
item ["like_count"] = site.select("//em[@class='socialMetaCount likeCountSmall']/text()").extract()[0].strip()
item ["board_name"] = site.select("//div[@class='creditTitle']/text()").extract()[0].strip()
items.append(item)
return items
它确实给了我几行输出,但显然都具有相同的信息,因此它抓取了页面上引脚数量的项目,但都具有相同的输出:
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
{"pin_link": "/pin/94716398388371133/", "board_name": "Take Me Fishing", "like_count": "3", "repin_count": "21"}
etc.