我使用 beautifulsoup 很棒的 html 解析器编写了一个小包装器
最近,我尝试改进代码并使所有 beautifulsoup 方法直接在包装类中可用(而不是通过类属性),我认为子类化 beautifulsoup 解析器将是实现此目的的最佳方法。
这是该类:
class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup
class Scrape(BeautifulSoup):
"""base class to be subclassed
basically a subclassed BeautifulSoup wrapper that providers
basic url fetching with urllib2
and the basic html parsing with beautifulsoup
and some basic cleaning of head,scripts etc'"""
def __init__(self,file):
self._file = file
#very basic input validation
import re
if not re.search(r"^http://",self._file):
raise ScrapeInputError,"please enter a url that starts with http://"
import urllib2
#from BeautifulSoup import BeautifulSoup
self._page = urllib2.urlopen(self._file) #fetching the page
BeautifulSoup.__init__(self,self._page)
#self._soup = BeautifulSoup(self._page) #calling the html parser
这样我就可以开始上课了
x = Scrape("http://someurl.com")
并能够使用 x.elem 或 x.find 遍历树
这对于一些 beautifulsoup 方法(见上文)效果很好,但对于其他方法却失败了 - 那些使用像“for e in x:”这样的迭代器的方法
错误消息:
Traceback (most recent call last):
File "<pyshell#86>", line 2, in <module>
print e
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
我研究了错误消息,但找不到任何可以使用的东西 - 因为我不想玩 BeautifulSoup 的内部植入(老实说,我不知道或不理解__slot__
or __getstate__
..)我只想使用该功能。
我尝试从返回一个 beautifulsoup 对象,而不是子类化__init__
班级的但是__init__
方法返回None
很高兴在这里得到任何帮助。