I made an improvement to my code according to this suggestion from @paultrmbrth. what i need is to scrape data from pages that are similar to this and this one and i want the csv output to be like the picture below.
But my code's csv output is little messy, like this:
我有两个问题,csv 输出是否可以像第一张图片一样?我的第二个问题是,我也希望删除电影标题,请给我一个提示或提供给我一个代码,我可以用它来删除电影标题和内容。
UPDATE
这个问题已经被Tarun Lalwani完美解决了。但现在,csv 文件的标头仅包含第一个抓取的 url 类别。例如当我尝试刮擦时这个网页其中有References, Referenced in, Features, Featured in and Spoofed in
类别和这个网页其中有Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in
类别,那么 csv 输出文件标题将仅包含第一个网页的类别,即References, Referenced in, Features, Featured in and Spoofed in
所以第二个网页中的一些类别,例如Follows, Followed by, Edited from, Edited into and Spoofs
不会出现在输出 csv 文件标题上,其内容也是如此。
这是我使用的代码:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["imdb.com"]
start_urls = (
'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
)
def parse(self, response):
item = {}
for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
key = h4.xpath('normalize-space()').get().strip()
if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
'Features']:
values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
'string(.//a)').getall(),
item[key] = values
yield item
这是exporters.py
file:
try:
from itertools import zip_longest as zip_longest
except:
from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings
class NewLineRowCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
fields = self._get_serialized_fields(item, default_value='',
include_empty=True)
values = list(self._build_row(x for _, x in fields))
values = [
(val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
if type(val) in (list, tuple)
else (val, )
for val in values]
multi_row = zip_longest(*values, fillvalue='')
for row in multi_row:
self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])
我想要实现的是我希望所有这些类别都位于 csv 输出标题上。
'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'
任何帮助,将不胜感激。