您似乎正在尝试检索与给定段落相关的维基百科类别。
小建议
首先,我建议您执行单个请求,将 DBpedia Spotlight 结果收集到VALUES https://www.w3.org/TR/sparql11-query/#inline-data,例如,这样:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
其次,如果你谈论的是主题等级制度,你应该使用SPARQL 1.1属性路径 https://www.w3.org/TR/sparql11-query/#propertypaths.
这两个建议有些不相容。当查询包含多个起点(即VALUES
)和任意长度的路径(即*
and +
运营商)。
下面我使用的是dct:subject/skos:broader
属性路径,即检索“下一级”类别。
方法一
第一种方法是根据资源的普遍受欢迎程度对资源进行排序,例如。 G。他们的PageRank http://people.aifb.kit.edu/ath/#DBpedia_PageRank:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
结果是:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
方法2
第二种方法是计算给定文本的类别频率......
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
结果是:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
With dct:subject
代替dct:subject/skos:broader
,结果更好:
dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
结论
结果不太好。我认为有两个原因:DBpedia 类别相当随意,工具相当原始。结合方法1和方法2也许可以取得更好的结果。无论如何,需要用大语料库进行实验。