Python:如何检索每年的谷歌学术引用?

2024-05-05

我正在尝试从 Google Scholar 个人资料中检索信息。我有url

from bs4 import SoupStrainer, BeautifulSoup
from urllib2 import Request, urlopen

url = "https://scholar.google.com/citations?user=qc6CJjYAAAAJ"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser') 

我能够得到类似的信息h_index, i10_index and citations通过以下方式:

indexes = soup.find_all("td", "gsc_rsb_std")
h_index = indexes[2].string
i10_index = indexes[4].string
citations = indexes[0].string

现在我想知道如何获得每年的引用总数,如谷歌学术图表中所示。


Use re.compile查找三类表格数据,包括引用年份:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
final_data = [[b.text for b in i] for i in s.find_all('td', {'class':re.compile('gsc_a_t|gsc_a_c|gsc_a_y')})]
grouped_data = [final_data[i:i+3] for i in range(0, len(final_data), 3)]
citations = [dict(zip(['title', 'cited by', 'year'], map(lambda x:x[0], i))) for i in grouped_data]

Output:

[{'year': u'1935', 'cited by': u'16471', 'title': u'Can quantum-mechanical description of physical reality be considered complete?'}, {'year': u'1905', 'cited by': u'10925', 'title': u'Uber einen die Erzeugung und Verwandlung des Lichtes betreffenden heurischen Gesichtpunkt'}, {'year': u'1905', 'cited by': u'9425', 'title': u'On the movement of small particles suspended in stationary liquids required by the molecular-kinetic theory of heat'}, {'year': u'1956', 'cited by': u'4648', 'title': u'Investigations on the Theory of the Brownian Movement'}, {'year': u'', 'cited by': u'4540', 'title': u'Zur Elektrodynamik bewegter K\xf6rper'}, {'year': u'1911', 'cited by': u'4285', 'title': u'Graviton Mass and Inertia Mass'}, {'year': u'1918', 'cited by': u'4196', 'title': u'On gravitational waves Sitzungsber. preuss'}, {'year': u'1925', 'cited by': u'3947', 'title': u'Sitzungsber. K'}, {'year': u'1917', 'cited by': u'3914', 'title': u'Sitzungsberichte der Preussischen Akad. d'}, {'year': u'1906', 'cited by': u'3633', 'title': u'Eine neue bestimmung der molek\xfcldimensionen'}, {'year': u'1950', 'cited by': u'3538', 'title': u'The meaning of relativity'}, {'year': u'1998', 'cited by': u'3472', 'title': u'Ueber einen die Erzeugung und Verwandlung des Lichtes betreffenden heuristischen Gesichtspunkt'}, {'year': u'1915', 'cited by': u'3065', 'title': u'Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin'}, {'year': u'1954', 'cited by': u'2969', 'title': u'Evolution of Physics'}, {'year': u'1920', 'cited by': u'2919', 'title': u'The special and general theory'}, {'year': u'2006', 'cited by': u'2804', 'title': u'Die grundlage der allgemeinen relativit\xe4tstheorie'}, {'year': u'1982', 'cited by': u'2643', 'title': u'The Science and the Life of Albert Einstein'}, {'year': u'1917', 'cited by': u'2570', 'title': u'Zur quantentheorie der strahlung'}, {'year': u'1954', 'cited by': u'2489', 'title': u'Physics and Reality, in \u201cIdeas and Opinions\u201d'}, {'year': u'1924', 'cited by': u'2440', 'title': u'Quantum theory of monatomic ideal gases'}]

编辑:要查找图表中的值,请稍微更改传递给的数据find_all:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://scholar.google.com/citations?user=qc6CJjYAAAAJ').read()), 'lxml')
years = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_t'})])
citation_number = map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_al'})])
final_chart_data = dict(zip(years, citation_number))

Output:

{1979: 774, 1980: 649, 1981: 572, 1982: 722, 1983: 680, 1984: 725, 1985: 743, 1986: 664, 1987: 776, 1988: 792, 1989: 879, 1990: 924, 1991: 831, 1992: 1071, 1993: 1016, 1994: 1197, 1995: 1300, 1996: 1283, 1997: 1409, 1998: 1433, 1999: 1777, 2000: 1987, 2001: 2300, 2002: 2347, 2003: 2449, 2004: 2927, 2005: 4436, 2006: 4059, 2007: 4476, 2008: 4409, 2009: 4709, 2010: 4586, 2011: 5139, 2012: 5797, 2013: 6160, 2014: 5985, 2015: 6463, 2016: 6760, 2017: 6356, 2018: 396}
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Python:如何检索每年的谷歌学术引用? 的相关文章

随机推荐