我想将宏观趋势数据导入 pandas 数据框架。从网站的页面源来看,数据似乎位于 jqxgrid 中。
我尝试使用 pandas/beautiful soup 和 read_html 函数,但没有找到表。我目前正在尝试使用硒来提取数据。我希望如果我可以移动水平滚动条,表 jqxgrid 将被加载并能够被提取。然而,这并没有奏效。
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script("window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)
grid = driver.find_element_by_id('jqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')
time.sleep(1)
actions = ActionChains(driver)
time.sleep(1)
for i in range(1,6):
actions.drag_and_drop_by_offset(scrollbar,i*70,0).perform()
time.sleep(1)
pd.read_html(grid.get_attribute('outerHTML'))
我得到的错误是:
ValueError:未找到表
我期望表数据来自“http://www.macrotrends.net/stocks/charts/AMZN/amazon/venue-statement?freq=Q http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q" 导入到数据框中
这是比 selenium 更快的替代方案,其标题如页面所示。
import requests
from bs4 import BeautifulSoup as bs
import re
import json
import pandas as pd
r = requests.get('https://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q')
p = re.compile(r' var originalData = (.*?);\r\n\r\n\r',re.DOTALL)
data = json.loads(p.findall(r.text)[0])
headers = list(data[0].keys())
headers.remove('popup_icon')
result = []
for row in data:
soup = bs(row['field_name'])
field_name = soup.select_one('a, span').text
fields = list(row.values())[2:]
fields.insert(0, field_name)
result.append(fields)
pd.option_context('display.max_rows', None, 'display.max_columns', None)
df = pd.DataFrame(result, columns = headers)
print(df.head())
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)