抓取和解析多页(aspx)表

2024-04-09

我正在尝试搜集有关灰狗比赛的信息。例如,我想刮http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena。此页面显示了狗 Hardwick Serena 的所有结果,但它分为几个页面。

检查页面,它显示在“下一页”按钮下:

<input type="submit" name="ctl00$ctl00$mainContent$cmscontent$DogRaceCard$lvDogRaceCard$ctl00$ctl03$ctl01$ctl12" value=" " title="Next Page" class="rgPageNext">. 

我希望有一个 HTML 链接,可以用于下一次抓取迭代,但没有成功。 通过查看网络流量进行进一步检查,发现浏览器为 __VIEWSTATE 等发送了一个非常长的(散列?)字符串。可能保护数据库?

我正在寻找一种方法来抓取一只狗的所有页面,或者通过迭代所有页面,或者通过增加页面长度以在第 1 页上显示 100 多行。底层数据库是 .aspx。

我正在使用 Python 3.5 和 BeautifulSoup。

当前代码:

    import requests
    from   bs4 import BeautifulSoup

    url = 'http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena'

    with requests.session() as s:
        s.headers['user-agent'] = 'Mozilla/5.0'

        r    = s.get(url)
        soup = BeautifulSoup(r.content, 'html5lib')

        target = 'ctl00$ctl00$mainContent$cmscontent$DogRaceCard$btnFilter_input'

        data = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=ctl00]') if tag.get('value')
        }
        state = { tag['name']: tag['value'] 
            for tag in soup.select('input[name^=__]')
        }

        data.update(state)

        numberpages = int(str(soup.find('div', 'rgWrap rgInfoPart')).split(' ')[-2].split('>')[1].split('<')[0])
        # for page in range(last_page + 1):

        for page in range(numberpages):
            data['__EVENTTARGET'] = target.format(page)
            #data['__VIEWSTATE'] = target.format(page)
            print(10)
            r    = s.post(url, data=data)
            soup = BeautifulSoup(r.content, 'html5lib')

            tables = soup.findChildren('table')
            my_table = tables[9]
            rows = my_table.findChildren(['th', 'tr'])

            tabel = [[]]
            for i in range(len(rows)):
                 cells = rows[i].findChildren('td')
                 tabel.append([])
                 for j in range(len(cells)):
                     value = cells[j].string
                     tabel[i].append(value)

            table = []
            for i in range(len(tabel)):
                if len(tabel[i]) == 16:
                    del tabel[i][-2:]
                    table.append(tabel[i])

在这种情况下,对于每个请求的页面POST使用表单 url 编码参数发出请求__EVENTTARGET & __VIEWSTATE :

  • __VIEWSTATE可以很容易地从input tag
  • __EVENTTARGET每个页面的值都不同,并且该值是从每个页面链接的 javascript 函数传递的,因此您可以使用正则表达式提取它:

    <a href="javascript:__doPostBack('ctl00$ctl00$mainContent$cmscontent$DogRaceCard$lvDogRaceCard$ctl00$ctl03$ctl01$ctl07','')">
        <span>2</span>
    </a>
    

python脚本:

from bs4 import BeautifulSoup
import requests
import re

# extract data from page
def extract_data(soup):
    tables = soup.find_all("div", {"class":"race-card"})[0].find_all("tbody")

    item_list = [
        (
            t[0].text.strip(), #date
            t[1].text.strip(), #dist
            t[2].text.strip(), #TP
            t[3].text.strip(), #StmHCP
            t[4].text.strip(), #Fin
            t[5].text.strip(), #By
            t[6].text.strip(), #WinnerOr2nd
            t[7].text.strip(), #Venue
            t[8].text.strip(), #Remarks
            t[9].text.strip(), #WinTime
            t[10].text.strip(), #Going
            t[11].text.strip(), #SP
            t[12].text.strip(), #Class
            t[13].text.strip()  #CalcTm
        )
        for t in (t.find_all('td') for t in tables[1].find_all('tr'))
        if t
    ]
    print(item_list)

session = requests.Session()

url = 'http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena'

response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# get view state value
view_state = soup.find_all("input", {"id":"__VIEWSTATE"})[0]["value"]

# get all event target values
event_target = soup.find_all("div", {"class":"rgNumPart"})[0]
event_target_list = [
    re.search('__doPostBack\(\'(.*)\',', t["href"]).group(1)
    for t in event_target.find_all('a')
]

# extract data for the 1st page
extract_data(soup)

# extract data for each page except the first
for link in event_target_list[1:]:
    print("get page {0}".format(link))
    post_data = {
        '__EVENTTARGET': link,
        '__VIEWSTATE': view_state
    }
    response = session.post(url, data=post_data)
    soup = BeautifulSoup(response.content, "html.parser")
    extract_data(soup)
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

抓取和解析多页(aspx)表 的相关文章

随机推荐