从 url 不变的网站中抓取响应表

2024-02-15

我希望从该网站抓取价格历史记录:单击价格历史记录按钮后,表格将被加载,但网址保持不变。我想刮掉桌子上的负载。

import requests
from bs4 import BeautifulSoup
rr = requests.get(url)
htmll = rr.text
soup = BeautifulSoup(htmll)

Using DevTools (tab: Network)在Chrome/Firefox中你可以看到这个页面使用JavaScript从另一个 URL 加载数据。

https://www.sharesansar.com/company-price-history?draw=1&columns%5B0%5D%5Bdata%5D=DT_Row_Index&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=false&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=published_date&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=open&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=false&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=high&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=false&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=low&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=false&columns%5B4%5D%5Borderable%5D=false&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=close&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=false&columns%5B5%5D%5Borderable%5D=false&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=per_change&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=false&columns%5B6%5D%5Borderable%5D=false&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=traded_quantity&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=false&columns%5B7%5D%5Borderable%5D=false&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B8%5D%5Bdata%5D=traded_amount&columns%5B8%5D%5Bname%5D=&columns%5B8%5D%5Bsearchable%5D=false&columns%5B8%5D%5Borderable%5D=false&columns%5B8%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B8%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=20&search%5Bvalue%5D=&search%5Bregex%5D=false&company=95&_=1639245456705'

Using requests使用此 url,您可以获取 JSON 数据形式的表,并且不需要BeautifulSoup


在代码中,我将此 url 中的所有值转换为payload- 这样您就可以轻松替换值以获得不同的数据。

也许如果你改变'start' (20, 40等)然后您可以获取表中的下一页。

如果你使用'length": 50那么你可以在一个请求中获得更多的值 - 但更大的值不起作用。

顺便说一句:这个网址需要标题X-Requested-With这是由AJAX要求。

import requests

payload = {
 '_': '1639245456705',
 'columns[0][data]': 'DT_Row_Index',
 'columns[0][orderable]': 'false',
 'columns[0][search][regex]': 'false',
 'columns[0][searchable]': 'false',
 'columns[1][data]': 'published_date',
 'columns[1][orderable]': 'false',
 'columns[1][search][regex]': 'false',
 'columns[1][searchable]': 'true',
 'columns[2][data]': 'open',
 'columns[2][orderable]': 'false',
 'columns[2][search][regex]': 'false',
 'columns[2][searchable]': 'false',
 'columns[3][data]': 'high',
 'columns[3][orderable]': 'false',
 'columns[3][search][regex]': 'false',
 'columns[3][searchable]': 'false',
 'columns[4][data]': 'low',
 'columns[4][orderable]': 'false',
 'columns[4][search][regex]': 'false',
 'columns[4][searchable]': 'false',
 'columns[5][data]': 'close',
 'columns[5][orderable]': 'false',
 'columns[5][search][regex]': 'false',
 'columns[5][searchable]': 'false',
 'columns[6][data]': 'per_change',
 'columns[6][orderable]': 'false',
 'columns[6][search][regex]': 'false',
 'columns[6][searchable]': 'false',
 'columns[7][data]': 'traded_quantity',
 'columns[7][orderable]': 'false',
 'columns[7][search][regex]': 'false',
 'columns[7][searchable]': 'false',
 'columns[8][data]': 'traded_amount',
 'columns[8][orderable]': 'false',
 'columns[8][search][regex]': 'false',
 'columns[8][searchable]': 'false',
 'company': '95',
 'draw': '1',
 'length': '20',
 'search[regex]': 'false',
 'start': '0'
}

headers = {
    'X-Requested-With': 'XMLHttpRequest'
}

url = 'https://www.sharesansar.com/company-price-history'

response = requests.get(url, params=payload, headers=headers)
data = response.json() 
#print(data)

print('NR  | DATA       | OPEN   | CLOSE  |')
for number, item in enumerate(data['data'], 1):
    print(f'{number:3} |', item['published_date'], "|", item['open'], "|", item['close'], "|")

Result:

NR  | DATA       | OPEN   | CLOSE  |
  1 | 2021-12-09 | 208.70 | 206.00 |
  2 | 2021-12-08 | 214.90 | 205.00 |
  3 | 2021-12-07 | 218.00 | 211.00 |
  4 | 2021-12-06 | 208.00 | 214.00 |
  5 | 2021-12-05 | 215.00 | 211.00 |
  6 | 2021-12-02 | 225.00 | 217.10 |
  7 | 2021-12-01 | 226.00 | 224.50 |
  8 | 2021-11-30 | 224.60 | 225.00 |
  9 | 2021-11-29 | 227.00 | 225.00 |
 10 | 2021-11-28 | 233.00 | 227.00 |
 11 | 2021-11-25 | 233.00 | 237.00 |
 12 | 2021-11-24 | 228.00 | 230.00 |
 13 | 2021-11-23 | 233.50 | 230.10 |
 14 | 2021-11-22 | 238.00 | 237.00 |
 15 | 2021-11-21 | 242.70 | 234.40 |
 16 | 2021-11-18 | 236.00 | 240.00 |
 17 | 2021-11-17 | 243.00 | 240.00 |
 18 | 2021-11-16 | 232.00 | 239.90 |
 19 | 2021-11-15 | 226.00 | 231.50 |
 20 | 2021-11-14 | 228.00 | 225.60 |
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

从 url 不变的网站中抓取响应表 的相关文章

随机推荐