我想知道当你传递标题时会有什么不同requests.get
即之间的差异requests.get(url, headers)
and requests.get(url)
.
我有这两段代码:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
page = requests.get(url)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
正如您所期望的,它的输出是一个 url。但是这个:
from lxml import html
from lxml import etree
import requests
import re
url = "http://www.amazon.in/SanDisk-micro-USB-connector-OTG-enabled-Android/dp/B00RBGYGMO"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
tree = html.fromstring(page.text)
XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@src'
image_source = tree.xpath(XPATH_IMAGE_SOURCE)
print 'type: ',type(image_source[0])
print image_source[0]
有一个以开头的输出data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAoHBwgHBgoIC
我猜这是没有渲染的实际图像,只是纯数据。知道如何将其保留为 url 形式吗?标头的存在还会以哪些其他方式影响我们得到的响应?
谢谢
Save the first code's response to html file and open in your browser:
正如你所看到的,你被亚马逊禁止没有标题。
使用这个xpath:
XPATH_IMAGE_SOURCE = '//*[@id="main-image-container"]//img/@data-old-hires'
out:
type: <class 'lxml.etree._ElementStringResult'>
http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg
这是原始的 html 数据:
<img alt=".." src=" data:image/webp;base64,UklGRuYIAABXRUJQVlA4INoIAACQQQCdASosAcsAPrFWpEqkIqQhIxN6gIgWCek6r4bUf/..."
data-old-hires="http://ecx.images-amazon.com/images/I/617TjMIouyL._SL1274_.jpg"
图片网址在data-old-hires
属性。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)