Scrapy：“str”对象没有属性“iter”

2024-01-24

I added restrict_xpaths我的 scrapy 蜘蛛的规则，现在它立即失败：

2015-03-16 15:46:53+0000 [tsr] ERROR: Spider error processing <GET http://www.thestudentroom.co.uk/forumdisplay.php?f=143>
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 602, in _tick
        taskObj._oneWorkUnit()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
        result = self._iterator.next()
      File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
        for x in result:
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
      File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 107, in extract_links
        links = self._extract_links(doc, response.url, response.encoding, base_url)
      File "/Library/Python/2.7/site-packages/scrapy/linkextractor.py", line 94, in _extract_links
        return self.link_extractor._extract_links(*args, **kwargs)
      File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 50, in _extract_links
        for el, attr, attr_val in self._iter_links(selector._root):
      **File "/Library/Python/2.7/site-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 38, in _iter_links
        for el in document.iter(etree.Element):
    exceptions.AttributeError: 'str' object has no attribute 'iter'**

我不明白为什么会发生这个错误。

这是我的短片Spider:

import scrapy

from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class TsrSpider(CrawlSpider):
    name = 'tsr'
    allowed_domains = ['thestudentroom.co.uk']
    start_urls = ['http://www.thestudentroom.co.uk/forumdisplay.php?f=143']

    download_delay = 4
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:35.0) Gecko/20100101 Firefox/35.0'

    rules = (
        Rule(
            LinkExtractor(
                allow=('forumdisplay\.php\?f=143\&page=\d',),
                restrict_xpaths=("//li[@class='pager-page_numbers']/a/@href",))),

        Rule(
            LinkExtractor(
                allow=('showthread\.php\?t=\d+\&page=\d+',),
                restrict_xpaths=("//li[@class='pager-page_numbers']/a/@href",)), 
            callback='parse_link'),

        Rule(
            LinkExtractor(
                allow=('showthread\.php\?t=\d+',),
                restrict_xpaths=("//tr[@class='thread  unread    ']",)),
            callback='parse_link'),
        )

    def parse_link(self, response):
#           Iterate over posts.     
        for sel in response.xpath("//li[@class='post threadpost old   ']"):
            rating = sel.xpath(
            "div[@class='post-footer']//span[@class='score']/text()").extract()
            if not rating:
                rating = 0
            else:
                rating = rating[0]
            item = DmozItem()
            item['post'] = sel.xpath(
    "div[@class='post-content']/blockquote[@class='postcontent restore']/text()").extract()
            item['link'] = response.url
            item['topic'] = response.xpath(
    "//div[@class='forum-header section-header']/h1/span/text()").extract()
            item['rating'] = rating
            yield item

source: http://pastebin.com/YXdWvPgX http://pastebin.com/YXdWvPgX

有人可以帮我吗？错误在哪里？我找了好几天了！？

问题是restrict_xpaths should 指向元素- 直接链接或包含链接的容器，而不是属性：

rules = [
    Rule(LinkExtractor(allow='forumdisplay\.php\?f=143\&page=\d',
                       restrict_xpaths="//li[@class='pager-page_numbers']/a")),

    Rule(LinkExtractor(allow='showthread\.php\?t=\d+\&page=\d+',
                       restrict_xpaths="//li[@class='pager-page_numbers']/a"),
         callback='parse_link'),

    Rule(LinkExtractor(allow='showthread\.php\?t=\d+',
                       restrict_xpaths="//tr[@class='thread  unread    ']"),
         callback='parse_link'),
]

经过测试（为我工作）。

仅供参考，Scrapy 定义restrict_xpaths http://doc.scrapy.org/en/0.22/topics/link-extractors.html#scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor作为“指向区域的表达式”：

restrict_xpaths(str 或 list) – 是一个 XPath（或 XPath 列表）定义regions在响应中应提取链接的地方从。如果给定，则仅扫描那些 XPath 选择的文本用于链接。请参阅下面的示例。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

Scrapy

Scrapy：“str”对象没有属性“iter” 的相关文章

Python dict 到 DataFrame Pandas - 级别

几个月前 Romain X 在这个问题上帮了我很多忙 Python 字典到 DataFrame Pandas https stackoverflow com questions 32770359 python dict to datafra
nginx/uwsgi 服务器的持久内存中 Python 对象

我怀疑这是否可能但这是问题和提出的解决方案提出的解决方案的可行性是这个问题的对象我有一些需要可用于所有请求的全局数据我将这些数据保存到 Riak 并使用 Redis 作为缓存层以提高访问速度目前数据被分为约 30 个逻辑块每
来自 Pandas DataFrame 的用户定义的 Json 格式

我有一个 pandas dataFrame 打印 pandas DataFrame 后结果如下所示 country branch no of employee total salary count DOB count email x a
Python矩阵问题[重复]

这个问题在这里已经有答案了这是从这个线程继续的 Python矩阵有什么解决方案吗 https stackoverflow com questions 5835583 python matrix any solution Input fr
安装 Ta-lib 会产生 gcc 错误

当我尝试在我的 mac 上将 Ta lib 作为全局包安装时出现 gcc 错误我收到以下错误 gcc Wno unused result Wsign compare Wunreachable code DNDEBUG g fwrapv
在 Tkinter 中调整另一个小部件内的一个小部件的大小

我正在开发穆斯堡尔光谱化学的模拟软件但在设计 UI 时我在使用父窗口小部件调整子窗口小部件的大小时遇到了问题当窗口最大化时父框架会填充额外的空间但子窗口小部件不会更改其大小 from Tkinter import impor
numpy 中用最少内存对上三角元素求和的最快方法

我需要进行此类求和i
argparse - 禁用相同参数的出现

我正在尝试使用 argparse 禁用一个命令行中出现相同的参数 python3 argument1 something argument2 argument1 something else 这意味着这应该会引发错误因为 argument
ImportError：无法导入名称 GstRtspServer，未找到内省类型库

我目前正在尝试让一个简单的 GstRtspServer 程序在外部亚马逊 Linux EC2 服务器上运行但在让它实际运行时遇到了严重的问题无论我做什么当我尝试运行它时即使程序仅减少到 import gi gi require ve
更改散景图中选项卡的样式

我想知道是否有办法更改散景图上生成的选项卡的属性诸如增加文本字体更改制表符宽度等更改以下是用于生成具有两个选项卡的绘图的简单代码 from bokeh models widgets import Panel Tabs from bok
argparse add_argument 别名

有没有办法使用 argparse 创建别名例如我想做这样的事情 parser add argument foo parser add argument alias bar foo 也就是说使用 bar应该相当于使用 foo 您可以简单
使用存储的密钥作为环境变量

我有一个秘密密钥存储在 GCP 的秘密管理器中我们的想法是使用该密钥通过云功能获取预算列表现在我可以从代码中访问该密钥但我面临的问题是我需要使用该密钥设置一个环境变量这是我添加密钥的方式如果您的本地目录中有该文件但是还有其他方
python中终止进程的跨平台方法

当我尝试使用 subprocess Popen terminate 或 Kill 命令终止 Windows 中的进程时出现访问被拒绝错误如果文件不再存在我真的需要一种跨平台的方式来终止进程是的我知道这不是做我正在做的事情的最优雅的
从 Java 调用 Python 代码时出现问题（不使用 jython）

我发现这是从 java 运行使用 exec 方法 python 脚本的方法之一我在 python 文件中有一个简单的打印语句但是我的程序在运行时什么也没做它既不打印Python文件中编写的语句也不抛出异常程序什么都不做就终止了
Python、Oracle DB、列中的 XML 数据，获取 cx_Oracle.Object

我正在使用 python 从 Oracle DB 获取数据所有行都有一个包含 XML 数据的列当我使用 python 打印从 Oracle DB 获取的数据时包含 XML 数据的列将打印为 0x7fffe373b960 处的 cx O
深度学习——一些关于caffe的幼稚问题

我试图了解 caffe 的基础知识特别是与 python 一起使用我的理解是模型定义比如给定的神经网络架构必须包含在 prototxt file 当您使用数据训练模型时 prototxt 您将权重模型参数保存到 caffemode
Django populate() 不可重入

当我尝试在生产环境中加载 Django 应用程序时我不断收到此消息我尝试了所有的 stackoverflow 答案但没有任何解决办法任何其他想法我使用的是 Django 1 5 2 和 Apache Traceback most
连接 Flask Socket.IO Server 和 Flutter

基本上我有一个套接字 io 烧瓶代码 import cv2 import numpy as np from flask import Flask render template from flask socketio import Soc
计算列表中的子列表

L 2 4 5 6 2 1 6 6 3 2 4 5 3 4 5 我想知道任意子序列出现了多少次 s 2 4 5 例如会返回2次 I tried L count s 但它不起作用因为我认为它期望寻找类似的东西 random numbers
类型错误：对于仅使用浮点数的函数，返回数组必须是 ArrayType

这个实在是难倒我了我有一个计算单词权重的函数我已经确认 a 和 b 局部变量都是 float 类型 def word weight term a term freq term print a type a b idf term prin

随机推荐

Git 忽略除某个目录的所有子目录之外的所有特定类型的文件？

我正在尝试创建一个 gitignore 文件该文件将忽略所有 jar 文件除非它们位于名为 libs 的文件夹中这是我的基本文件结构 gitignore libs goodFile jar someFolder subFolder a
如何在 firebase 托管中包含子目录

我的网站由主页和文件夹内分隔的子页面组成如何在我的页面主机中包含子页面 firebase 托管弗兰克的评论确实是一个答案 Firebase 会在您指定的目录下部署所有内容因此如果您的主页 index html 位于当前目录中而其他
TCP RST 数据包详细信息

为什么 TCP RST 数据包不需要确认是不是因为RST的发送方每次收到对方的数据包后都会继续发送RST 相关说明有效的 RST 数据包中的确认号怎么可能是 0 相关说明 RST 数据包中的确认号如何是 0 因为设置了 RST 位的段
添加千分尺依赖项会导致奇怪的 Spring 代理问题

我有一个带有私有 Scheduled 方法的简单 Spring Boot 应用程序 SpringBootApplication EnableScheduling public class DemoApplication public sta
为什么 Random.nextLong 不能在 Java 中生成所有可能的 long 值？

Random 类的 nextLong 方法的 Javadoc 指出由于 Random 类仅使用 48 位的种子因此该算法不会返回所有可能的长值随机javadoc http docs oracle com javase 7 docs a
java.lang.ClassCastException: com.mchange.v2.c3p0.impl.NewProxyConnection

我得到以下信息 java lang ClassCastException com mchange v2 c3p0 impl NewProxyConnection 当下面的代码执行时你能帮我解决一下吗 ComboPooledDataSour
jQuery 动画颜色变化

我正在尝试将链接颜色从当前颜色更改为其他颜色的动画 window load function article preview h1 a hover function this animate color ffffff 1500 由于某种原因
比较两个日期时间

label1显示我通过查询从数据库获取的最后交易日期时间 label2是系统日期时间我有一个执行命令按钮的计时器之后我想检查 label1 中的日期时间是否小于 5 分钟如果是这样的话我想展示一下按摩但我不知道为什么我的代码无
使用phonegap在android模拟器中调试javascript

我是phonegap 和android 开发的新手我可以知道如何在模拟器上调试 javascript 错误吗我听说过 ADB 请问我如何在 Windows 7 系统上使用和安装它我有一个使用 jsonp 调用的 ajax 但模拟器上没
用标志交换存储库

我有一个 IRepository 接口其中包含许多 T 和多个实现按需数据库 Web 服务等我使用 AutoFac 为许多 T 注册 IRepository 具体取决于我想要为每个 T 指定的存储库类型我还有一个基于 NET 缓存的
UnsupportedClassVersionError：WebSphere AS 7 中的 JVMCFRE003 错误主要版本

我收到这个错误 java lang UnsupportedClassVersionError JVMCFRE003 错误的主要版本类地图 CareMonths 偏移 6 我的 Eclipse 的 Java 编译器设置为1 6我在 C P
使用 Traefik 进行 SSL 直通

我需要将 SSL 连接直接发送到后端而不是在我的 Traefik 上解密后端需要接收https请求我尝试了 traefik frontend passTLSCert true 选项但在访问我的 Web 应用程序时收到 404 页面未
直接堆栈和堆访问；虚拟级还是硬件级？

当我在 SO 上时我读了很多指导评论尤其是 C 语言动态分配总是在堆上自动分配在堆栈上但特别是对于普通 C 我不同意这一点因为 ISO IEC9899 甚至没有丢弃堆或堆栈的任何字它只是提到了三种存储持续时间静态自动和分配
如何将 Images.xcassets 置于源代码管理之下？

因此我将 AppIcons 和 LaunchImages 迁移到 Images xcassets 看起来工作正常但是我对启动图像进行了一些修改现在我去提交更改我使用的是仅限本地的 git 存储库并且似乎不允许将 Images xc
将数据库中的数据显示到 Android 的 listView

我试图将数据库中的所有数据显示到列表视图中创建数据库的代码数据处理程序 package com example testingforstack import android content ContentValues import an
Monostate __new__ 的 Python 弃用警告 - 有人能解释一下原因吗？

我有一个带有 Python 2 6 的基本 Monostate class Borg object shared state def new cls args kwargs self object new cls args kwargs s
C# enum covariance不起作用

我需要使用枚举作为协变类型假设我有这个代码 public enum EnumColor Blue 0 Red 1 public class Car IColoredObject
如何在 ajax 调用中传递访问令牌

我今天早些时候问了一个问题 https stackoverflow com questions 45442344 zendesk api ticket submission using javascript authorization关于使
List::Util 'shuffle' 实际上是如何工作的？

我目前正在使用 c5 0 构建一个分类器我有一个包含 8000 个条目的数据集每个条目都有自己的 ID 号 1 8000 在测试分类器的性能时我必须进行 5 组 10 90 训练数据测试数据的分割当然任何训练案例都不能再次出现
Scrapy：“str”对象没有属性“iter”

I added restrict xpaths我的 scrapy 蜘蛛的规则现在它立即失败 2015 03 16 15 46 53 0000 tsr ERROR Spider error processing

Scrapy：“str”对象没有属性“iter”

Scrapy：“str”对象没有属性“iter” 的相关文章

随机推荐

热门标签