归根结底,所有其他模块(feedparser
, mechanize
, and urllib2
) call httplib
这是抛出异常的地方。
现在,首先,我还使用 wget 下载了这个文件,生成的文件有 1854 字节。接下来,我尝试了urllib2
:
>>> import urllib2
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> f = urllib2.urlopen(url)
>>> f.headers.headers
['Cache-Control: private\r\n',
'Content-Type: text/xml; charset=utf-8\r\n',
'Server: Microsoft-IIS/7.5\r\n',
'X-AspNet-Version: 4.0.30319\r\n',
'X-Powered-By: ASP.NET\r\n',
'Date: Mon, 07 Jan 2013 23:21:51 GMT\r\n',
'Via: 1.1 BC1-ACLD\r\n',
'Transfer-Encoding: chunked\r\n',
'Connection: close\r\n']
>>> f.read()
< Full traceback cut >
IncompleteRead: IncompleteRead(1854 bytes read)
因此它正在读取所有 1854 个字节,但随后认为还有更多字节。如果我们明确告诉它只读取 1854 字节,它就会起作用:
>>> f = urllib2.urlopen(url)
>>> f.read(1854)
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
显然,只有当我们总是提前知道确切的长度时,这才有用。我们可以使用部分读取作为异常的属性返回的事实来捕获整个内容:
>>> try:
... contents = f.read()
... except httplib.IncompleteRead as e:
... contents = e.partial
...
>>> print contents
'\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">...snip...</rss>'
这篇博文 http://bobrochel.blogspot.co.nz/2010/11/bad-servers-chunked-encoding-and.html表明这是服务器的故障,并描述了如何对服务器进行猴子修补httplib.HTTPResponse.read()
方法与try..except
上面的块来处理幕后的事情:
import httplib
def patch_http_response_read(func):
def inner(*args):
try:
return func(*args)
except httplib.IncompleteRead, e:
return e.partial
return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)
我应用了补丁然后feedparser
worked:
>>> import feedparser
>>> url = 'http://hattiesburg.legistar.com/Feed.ashx?M=Calendar&ID=543375&GUID=83d4a09c-6b40-4300-a04b-f88884048d49&Mode=2013&Title=City+of+Hattiesburg%2c+MS+-+Calendar+(2013)'
>>> feedparser.parse(url)
{'bozo': 0,
'encoding': 'utf-8',
'entries': ...
'status': 200,
'version': 'rss20'}
这不是最好的做事方式,但似乎很有效。我在 HTTP 协议方面不够专业,无法确定服务器是否做错了事情,或者是否httplib
错误地处理了边缘情况。