如何从维基百科中获取纯文本

2024-01-15

我想编写一个仅获取维基百科描述部分的脚本。也就是说，当我说

/wiki bla bla bla

它将转到维基百科页面为bla bla bla http://en.wikipedia.org/wiki/Bla_Bla_Bla，获取以下内容，并将其返回到聊天室：

“Bla Bla Bla”是一首歌的名字由吉吉·达戈斯蒂诺制作。他描述了这首歌是“我写的一首思考曲” 所有喋喋不休的人没有说什么”。突出但无意义的声音样本取自英国乐队 Stretch 的歌曲《Why Did You Do It》

我怎样才能做到这一点？

这里有一些不同的可能方法；使用适合您的任何一个。我下面所有的代码示例都使用requests http://docs.python-requests.org/en/master/对于 API 的 HTTP 请求；你可以安装requests with pip install requests如果你有皮普。他们也都使用媒体维基API https://www.mediawiki.org/wiki/API:Main_page，并且两个使用query https://www.mediawiki.org/wiki/API:Query终点；如果您需要文档，请点击这些链接。

1. 使用以下命令直接从 API 获取整个页面或页面“提取”的纯文本表示形式`extracts` prop

请注意，此方法仅适用于 MediaWiki 网站文本提取扩展 https://www.mediawiki.org/wiki/Extension:TextExtracts。这尤其包括维基百科，但不包括一些较小的 Mediawiki 网站，例如，http://www.wikia.com/ http://www.wikia.com/

你想点击这样的网址

分解它，我们在那里得到以下参数（记录在https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts):

action=query, format=json, and title=Bla_Bla_Bla都是标准 MediaWiki API 参数
prop=extracts让我们使用 TextExtracts 扩展
exintro限制对第一节标题之前内容的响应
explaintext使响应中的摘录为纯文本而不是 HTML

然后解析 JSON 响应并提取摘录：

>>> import requests
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'extracts',
...         'exintro': True,
...         'explaintext': True,
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2. 使用以下命令获取页面的完整 HTML`parse`端点，解析它并提取第一段

MediaWiki 有一个parse终点 https://www.mediawiki.org/wiki/API:Parsing_wikitext#parse你可以用类似的 URL 来访问获取页面的 HTML。然后你可以使用 HTML 解析器来解析它，例如lxml http://lxml.de/（首先安装它pip install lxml) 提取第一段。

例如：

>>> import requests
>>> from lxml import html
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'parse',
...         'page': 'Bla Bla Bla',
...         'format': 'json',
...     }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3. 自己解析维基文本

您可以使用query用于获取页面的维基文本的 API，使用它来解析它mwparserfromhell（首先使用安装它pip install mwparserfromhell），然后使用将其缩减为人类可读的文本strip_code http://mwparserfromhell.readthedocs.io/en/latest/api/mwparserfromhell.html#mwparserfromhell.wikicode.Wikicode.strip_code. strip_code在撰写本文时还不能完美地工作（如下面的示例所示），但希望能够改进。

>>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'revisions',
...         'rvprop': 'content',
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links


Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)