requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构

2024-04-07

我正在开发一个网络抓取项目，并遇到了以下错误。

requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构。也许你的意思是http://h http://h?

下面是我的代码。我从 html 表中检索所有链接，它们按预期打印出来。但是当我尝试使用 request.get 循环它们（链接）时，我收到上面的错误。

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)

你的错误是第二个for代码中循环

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href']为您提供单个网址，但您将其用作下一个列表for loop.

所以你有了

for link in ref['href']:

它给你 url 中的第一个字符http://properties.kimcore...这是h

完整的工作代码

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

BTW:如果你在中使用逗号(ref['href'], )然后你得到元组，然后是第二个for工作正常。

EDIT:它创建列表table_data在开始时并将所有数据添加到此列表中。最后转换成DataFrame。

但现在我看到它多次读取同一页面 - 因为在每一行中，每一列中都有相同的 url。您必须仅从一列获取 url。

EDIT:现在它不会多次读取同一个网址

EDIT:现在它从第一个链接获取文本和 hre 并在您使用时添加到列表中的每个元素append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

webscraping

pythonrequests

requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构的相关文章

为什么我的混淆矩阵只返回一个数字？

我正在做二元分类每当我的预测等于事实时我发现sklearn metrics confusion matrix返回单个值难道没有问题吗 from sklearn metrics import confusion matrix print
从字典的元素创建 Pandas 数据框

我正在尝试从字典创建一个 pandas 数据框字典设置为 nvalues y1 1 2 3 4 y2 5 6 7 8 y3 a b c d 我希望数据框仅包含 y1 and y2 到目前为止我可以使用 df pd DataFrame fr
数据框 - 平均列

我在 pandas 中有以下数据框 Column 1 Column 2 Column3 Column 4 2 2 2 4 1 2 2 3 我正在创建一个数据框其中包含第 1 列和第 2 列第 3 列和第 4 列等的平均值 ColumnA
将 C++ 指针作为参数传递给 Cython 函数

cdef extern from Foo h cdef cppclass Bar pass cdef class PyClass cdef Bar bar def cinit self Bar b bar b 这总是会给我类似的东西 Can
使用 Python 3 动态插入到 sqlite

我想使用 sqlite 写入多个表但我不想提前手动指定查询有数十种可能的排列例如 def insert sqlite tablename data list global dbc dbc execute insert into tab
python celery -A 的无效值无法加载应用程序

我有一个以下项目目录 azima init py main py tasks py task py from main import app app task def add x y return x y app task def mul
OpenCV 跟踪器：模型未在函数 init 中初始化

在视频的第一帧我运行一个对象检测器它返回对象的边界框如下所示
如何使用 Homebrew 在 Mac 上安装 Python 2 和 3？

我需要能够在 Python 2 和 3 之间来回切换我如何使用 Homebrew 来做到这一点因为我不想弄乱路径并陷入麻烦现在我已经通过 Homebrew 安装了 2 7 我会用pyenv https github com yyuu
如何使用 opencv python 计算乐高积木上的孔数？

我正在开发我的 python 项目我需要计算每个乐高积木组件中有多少个孔我将从输入 json 文件中获取有关需要计算哪个程序集的信息如下所示 img 001 red 0 blue 2 white 1 grey 1 yellow 1 r
在 Mac OSX 上从 Python 3.6 运行 wine 命令

我正在尝试用 Python 编写一个打开的脚本wine然后发送代码到wine终端打开一个 exe程序这 exe程序也是命令驱动的我可以打开wine 但我无法进一步 import shlex subprocess line usr bin
仅当某些值相等时，如何才能将一个文本文件中的值替换为另一个文本文件中的其他值？

我有一个名为finalscores txt我想创建一个 python 脚本它将打开它并从两个单独的列中读取值这是我的finalscores txt file Atom nVa predppm avgppm stdev delta QPr
PIL.Image.open和tf.image.decode_jpeg返回值的区别

我使用 PIL Image open 和 tf image decode jpeg 将图像文件解析为数组但发现PIL Image open 中的像素值与tf image decode jpeg不一样为什么会出现这种情况 Thanks 代
基于值而不是类型的单次调度

我在 Django 上构建 SPA 并且有一个庞大的功能其中包含许多功能if用于检查我的对象字段的状态名称的语句像这样 if self state new do some logic if self state archive do s
确定分割形状几何体的“左”侧和“右”侧

我的问题是我怎样才能确定哪一个Aside and Bside的侧面已经分割的旋转矩形几何体 http nbviewer jupyter org urls dl dropbox com s ll3mchnx0jwzjnf determine
numpy polyfit 中使用的权重值是多少以及拟合误差是多少

我正在尝试对 numpy 中的某些数据进行线性拟合 Ex 其中 w 是该值的样本数即对于点 x 0 y 0 我只有 1 个测量值该测量值是2 2 但对于这一点 1 1 我有 2 个测量值值为3 5 x np array 0 1 2 3
仅允许正小数

在我的 Django 模型中我创建了一个如下所示的小数字段 price models DecimalField u Price decimal places 2 max digits 12 显然价格为负或零是没有意义的有没有办法将小数
将时间添加到日期时间

我有一个像这样的日期字符串然后使用strptime 所以就像这样 my time datetime datetime strptime 07 05 15 m d Y 现在我想添加 23 小时 59 分钟my time 我努力了 timed
在matlab中，如何读取python pickle文件？

在 python 中我生成了一个 p 数据文件 pickle dump allData open myallData p wb 现在我想在Matlab中读取myallData p 我的Matlab安装在Windows 8下其中没有Pyt
描述符“join”需要“unicode”对象，但收到“str”

代码改编自here http wiki geany org howtos convert camelcase from foo bar to Foo Bar def lower case underscore to camel case s
使用 paramiko 运行 Sudo 命令

我正在尝试执行sudo使用 python paramiko 在远程计算机上运行命令我尝试了这段代码 import paramiko ssh paramiko SSHClient ssh set missing host key polic

随机推荐

通过 SSH 连接 MySQL 时遇到问题

我正在本地 OS X 计算机上运行 Node Express 网站我需要 ssh 到远程 mysql 数据库以便我可以开始针对它编写查询现在当我通过 OS X Yosemite 终端执行此操作时我可以 ssh 到云中的远程服务器
Firebase OrderByChild() 和 EqualTo() 无法正常工作

我需要在随机键中找到现有的子项并且我使用 OrderByChild 和 EqualTo 来过滤查询但它的行为非常奇怪有时它显示子项仅存在一个子项有时它不起作用我需要检查 February 2019 的子项 date expense
从 Resources 子文件夹中获取文件名

在我的资源文件夹中我有一个图像子文件夹我想从该文件夹中获取这些图像的所有文件名尝试了几个Resources loadAll之后获取 name 但没有成功的方法这是实现我在这里想做的事情的正确做法吗没有内置 API 可以执行此操作
从具有多个结果集的存储过程中检索数据

给定 SQL Server 中的一个存储过程它有多个select语句有没有办法在调用过程时单独处理这些结果例如 alter procedure dbo GetSomething as begin select from dbo Per
ASP.NET：权限/身份验证架构

我正在考虑建立一个验证在我的 ASP NET 应用程序中具有以下要求一名用户只有一个角色即管理员销售经理销售角色拥有一组 CRUD 访问现有对象子集的权限 IE 销售人员对对象类型产品具有 CREAD READ WRITE
Attention机制中的“源隐藏状态”指的是什么？

注意力权重计算如下我想知道什么h s指在tensorflow代码中编码器RNN返回一个元组 encoder outputs encoder state tf nn dynamic rnn 正如我所想 h s应该是encoder sta
MessagePack：快速跨平台序列化器和RPC - 请分享经验

寻找一些我偶然发现的快速简单且稳定的 RPC 库消息包 http msgpack org 项目看起来非常好它也正在积极开发中如果您以任何方式使用它可以分享一下您的经验吗附我认为这个问题应该是社区维基好吧过了一段时间我发现
访问 Django 模板中 ImageField 上的图像尺寸？

I have ImageField在我的模型中我可以在模板中显示图像但是如何检索图像的高度和宽度请参阅文档图像场 https docs djangoproject com en 1 11 ref models fields djan
`sorted(list)` 与 `list.sort()` 有什么区别？

list sort 对列表进行排序并替换原始列表而sorted list 返回列表的排序副本而不更改原始列表什么时候一个人比另一个人更受青睐哪个更有效率多少列表可以恢复到未排序状态吗list sort 已执行 Please us
web.config 中带点的路径

我需要在 web config 文件中添加一个位置元素但路径以点开头而且我认为我无法更改该路径它是为了让我们加密 http letsencrypt org自动化如果我让点就像
将自定义 SecurityExpressionOperations 中的方法注册为 Spring SpEL 函数

我有以下实现MethodSecurityExpressionOperations public class CustomMethodSecurityExpressionRoot extends SecurityExpressionRoot
Google 地图未显示在 Phonegap Build 中

我的电话间隙期只有两个月左右我一直在谷歌波纹模拟器中测试该应用程序并且谷歌地图的一切都运行良好但是当我将此项目上传到phonegap build 并将其安装到我的Android 设备中时谷歌地图不会显示这是我的index htm
JQuery，从字符串中删除元素

我有一个字符串 var s h1 heading h1 p para p 我想删除h1从中提取元素我试过了 s remove h1 但 s 仍然包含h1元素我也尝试过 s s remove h1 and h1 s remove and
汇总每日内容

我一直在尝试汇总有些不稳定的每日数据我实际上正在处理 csv 数据但如果我重新创建它它看起来像这样 library zoo dates lt c 20100505 20100505 20100506 20100507 val1 l
Flutter 应用程序在直接调用 Firebase 云函数时出现 UNAUTHENTICATED 错误

我尝试直接从我的 FLUTTER 应用程序使用 firebase 云函数 oncall 方法即使我登录了它仍然给我一个未经身份验证的错误颤振应用程序代码 CloudFunctions function CloudFunctions i
MySQL - 删除级联上的外键 - 是否有定义的执行顺序？

我的 MySQL 有问题多个表上的 CASCADE ON DELETE 规则显然 CASCADE ON DELETE 规则的执行顺序取决于它们的定义顺序但是这个执行顺序是否明确定义或者取决于 MySQL 版本这是我的三个表 A B
Sails.js：如何知道请求是否包含要上传的文件？

我的一个控制器中有下一个操作它的功能是上传图像文件或使用视频网址创建新的封面模型它工作正常但是当请求有标头时内容类型多部分表单数据并且不包含该操作引发超时错误的文件req file upload function req res
string.matches(regex) 返回 false，尽管我认为它应该是 true

我正在使用 Java 正则表达式哦我真的很想念 Perl Java正则表达式太难了无论如何下面是我的代码 oneLine kind list items System out println oneLine matches kind
Ruby on Rails，没有交互式请求的模板

在此输入图像描述 https i stack imgur com roT9c png 我之前运行过 rails generate controller Welcome index 在终端上运行命令并直接运行服务器但不断出现此错误消息没有
requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构

我正在开发一个网络抓取项目并遇到了以下错误 requests exceptions MissingSchema 无效的 URL h 未提供架构也许你的意思是http h http h 下面是我的代码我从 html 表中检索所有链接它

requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构

requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构 的相关文章

随机推荐

热门标签

requests.exceptions.MissingSchema：无效的 URL“h”：未提供架构的相关文章