完全菜鸟试图抓取此页面上的表格,我所能做的最远的是加载 rvest 包。我的问题是:
- 我找不到合适的元素;我通过检查器尝试的元素是“table.w782.comm.lsjz”,但它返回长度为0的列表,并在 html_table() 之后执行 %>% .[[1]] 即
fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table() %>% .[[1]]
也不行
(.[[1]] 中的错误:下标越界)
fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)
fund_table <- fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table()
- 该表有多个页面 (113),但单击第 2 页不会重新加载 html,因此我不知道如何将所有 113 页数据抓取到一页上...
真的很感谢任何关于我能做什么的指示......
在您最初的问题中,问题是该表显示为脚本而不是有效的 xml/html 表。使用您获得的 API 链接绝对是正确的选择。
library(rvest)
# You gave an API link and this is the best option for getting the data.
fund_link <- "https://fundf10.eastmoney.com/F10DataApi.aspx?type=lsjz&code=510300&page=1&sdate=2019-01-01&edate=2021-02-13&per=40"
fund_page <- read_html(fund_link)
# Any of these seem to work
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.comm") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.lsjz") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782.comm.lsjz") %>% html_table() %>% .[[1]]
# Original Question:
fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)
# The following doesn't work because the table you want is actually a script, not a table.
# <script id="lsjzTable" type="text/html">
# {{if Data && Data.LSJZList}}
# <table class="w782 comm lsjz">
# <thead>
# <tr>
# <th class="first"><U+51C0><U+503C><U+65E5><U+671F></th>
# {{if ((Data.FundType!="004" && Data.FundType!="005") || "510300"=="511880")}}
# <th><U+5355><U+4F4D><U+51C0><U+503C></th>
# <th><U+7D2F><U+8BA1><U+51C0><U+503C></th>
# {{if Data.FundType=="100"}}
# <th><U+5468><U+589E><U+957F><U+7387></th>
# {{else}}
# <th><U+65E5><U+589E><U+957F><U+7387><img id="jjjzTip" style="position: relative; top: 3px; left: 3px;" data-html="true" data-placement="bottom" title="<U+65E5><U+589E><U+957F><U+7387><U+4E3A><U+7A7A><U+539F><U+56E0><U+5982><U+4E0B>:<br>1<U+3001><U+975E><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+4E0D><U+53C2><U+4E0E><U+65E5><U+589E><U+957F><U+7387><U+8BA1><U+7B97>(<U+7070><U+8272><U+6570><U+636E><U+884C>)<U+3002><br>2<U+3001><U+4E0A><U+4E00><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+672A><U+62AB><U+9732>,<U+65E5><U+589E><U+957F><U+7387><U+65E0><U+6CD5><U+8BA1><U+7B97><U+3002>" src="//j5.dfcfw.com/image/201307/20130708102440.gif"></th>
# {{/if}}
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]
# The following is a partial solution but isn't fully working.
fund_table <- fund_page %>%
html_nodes("script#lsjzTable") %>%
as.character(.) %>%
stringr::str_remove_all("\\{\\{.+?\\}\\}") %>%
stringr::str_remove_all("\\<\\/?script.*?\\>") %>%
read_html() %>%
html_nodes("table") %>%
html_table()
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)