对于表中的每一行页面,我想单击 ID(例如,第 1 行的 ID 是 270516746)并将信息(每行没有相同的标题)提取/下载到某种形式的 python 对象中,最好是 json 对象,或数据框(json 可能更容易)。
我已经到了可以到达我想要拉下的表的地步:
import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import sys
driver = webdriver.Chrome()
driver.get('http://mahmi.org/explore.php?filterType=&filter=&page=1')
#find the table with ID, Sequence, Bioactivity and Similarity
element = driver.find_elements_by_css_selector('table.table-striped tr')
for row in element[1:2]: #change this, only for testing
id,seq,bioact,sim = row.text.split()
#now i've made a list of each rows id, sequence, bioactivity and similarity.
#click on each ID to get the full data of each
print(id)
button = driver.find_element_by_xpath('//button[text()="270516746"]') #this is one example hard-coded
button.click()
#then pull down all the info to a json file?
full_table = driver.find_element_by_xpath('.//*[@id="source-proteins"]')
print(full_table)
然后我陷入了可能是最后一步的困境,一旦单击上面一行中的按钮,我就找不到如何说“.to_json()”或“.to_dataframe()”。
如果有人可以提供建议,我将不胜感激。
更新1:删除并合并到上面。
更新 2:根据下面的建议,要使用 beautifulsoup,我的问题是如何导航到弹出窗口的“modal-body”类,然后使用 beautiful soup:
#then pull down all the info to a json file?
full_table = driver.find_element_by_class_name("modal-body")
soup = BeautifulSoup(full_table,'html.parser')
print(soup)
返回错误:
soup = BeautifulSoup(full_table,'html.parser')
File "/Users/kela/anaconda/envs/selenium_scripts/lib/python3.6/site-packages/bs4/__init__.py", line 287, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'WebElement' has no len()
更新 3:然后我尝试仅使用 beautifulsoup 来抓取页面:
from bs4 import BeautifulSoup
import requests
url = 'http://mahmi.org/explore.php?filterType=&filter=&page=1'
html_doc = requests.get(url).content
soup = BeautifulSoup(html_doc, 'html.parser')
container = soup.find("div", {"class": "modal-body"})
print(container)
它打印:
<div class="modal-body">
<h4><b>Reference information</b></h4>
<p>Id: <span id="info-ref-id">XXX</span></p>
<p>Bioactivity: <span id="info-ref-bio">XXX</span></p>
<p><a id="info-ref-seq">Download sequence</a></p><br/>
<h4><b>Source proteins</b></h4>
<div id="source-proteins"></div>
</div>
但这不是我想要的输出,因为它没有打印 json 层(例如,源蛋白质 div 下面有更多信息)。
更新 4,当我添加到上面的原始代码时(更新之前):
full_table = driver.find_element_by_class_name("modal-body")
with open('test_outputfile.json', 'w') as output:
json.dump(full_table, output)
输出是“TypeError:‘WebElement’类型的对象不是 JSON 可序列化”,我现在正在尝试弄清楚。
更新5:尝试复制this https://stackoverflow.com/questions/30945212/how-to-parse-selenium-driver-elements方法,我补充道:
full_div = driver.find_element_by_css_selector('div.modal-body')
for element in full_div:
new_element = element.find_element_by_css_selector('<li>Investigation type: metagenome</li>')
print(new_element.text)
(我刚刚添加了 li 元素只是为了看看它是否有效),但我收到错误:
Traceback (most recent call last):
File "scrape_mahmi.py", line 28, in <module>
for element in full_div:
TypeError: 'WebElement' object is not iterable
更新 6:我尝试循环遍历 ul/li 元素,因为我看到我想要的是嵌入在 ul in a li in a ul in a div 中的 li 文本;所以我尝试了:
html_list = driver.find_elements_by_tag_name('ul')
for each_ul in html_list:
items = each_ul.find_elements_by_tag_name('li')
for item in items:
next_ul = item.find_elements_by_tag_name('ul')
for each_ul in next_ul:
next_li = each_ul.find_elements_by_tag_name('li')
for each_li in next_li:
print(each_li.text)
这没有错误,我只是没有得到输出。