如何在Python中对维基百科类别进行分组？

2024-01-04

对于我的数据集的每个概念，我都存储了相应的维基百科类别。例如，考虑以下 5 个概念及其相应的维基百科类别。

高甘油三酯血症：['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂：['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
心脏搭桥手术：['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
perth: ['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候：['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

如您所见，前三个概念属于医学领域（而其余两个术语不是医学术语）。

更准确地说，我想将我的概念分为医学和非医学。然而，仅用类别来划分概念是非常困难的。例如，尽管这两个概念enzyme inhibitor and bypass surgery属于医学领域，它们的类别彼此有很大不同。

所以我想知道有没有办法获得parent category类别（例如，类别enzyme inhibitor and bypass surgery属于medical父类别）

我目前正在使用pymediawiki and pywikibot。但是，我不仅限于这两个库，并且也很高兴拥有使用其他库的解决方案。

EDIT

As suggested by @IlmariKaronen I am also using the categories of categories and the results I got is as follows (The small font near the category is the categories of the category).

但是，我仍然找不到一种方法来使用这些类别详细信息来确定给定术语是医学术语还是非医学术语。

此外，正如 @IlmariKaronen 所指出的，使用Wikiproject细节可能是潜在的。然而，似乎Medicinewikiproject 似乎没有所有的医学术语。因此我们还需要检查其他维基项目。

EDIT:我当前从维基百科概念中提取类别的代码如下。这可以使用以下方法完成pywikibot or pymediawiki如下。

使用图书馆pymediawiki

将 mediawiki 导入为 pw

p = wikipedia.page('enzyme inhibitor')
print(p.categories)

使用图书馆pywikibot

import pywikibot as pw

site = pw.Site('en', 'wikipedia')

print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

类别的类别也可以按照@IlmariKaronen 的答案中所示的相同方式完成。

如果您正在寻找更长的测试概念列表，我在下面提到了更多示例。

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

对于很长的列表，请检查下面的链接。https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意：我并不期望该解决方案 100% 有效（如果所提出的算法能够检测到许多医学概念，这对我来说就足够了）

如果需要，我很乐意提供更多详细信息。

解决方案概述

好吧，我会从多个方向来解决这个问题。这里有一些很好的建议，如果我是你，我会使用这些方法的集合（多数投票，预测你的二元案例中超过 50% 的分类器同意的标签）。

我正在考虑以下方法：

主动学习（下面我提供的示例方法）
MediaWiki 反向链接 https://stackoverflow.com/a/54757134/10886420作为答案提供@TavoGC https://stackoverflow.com/users/10317656/tavoglc
SPARQL作为对您的问题的评论提供的祖先类别@斯坦尼斯拉夫·克拉林 https://stackoverflow.com/users/7879193/stanislav-kralin and/or 父类别 https://stackoverflow.com/a/54781366/10886420由...提供@米娜·纳加拉詹 https://stackoverflow.com/users/10554298/meena-nagarajan（根据它们的差异，这两者可能是一个整体，但为此您必须联系两位创作者并比较他们的结果）。

这样，三分之二的人就必须同意某个概念是医学概念，这进一步减少了出错的可能性。

当我们讨论时我会争论against提出的方法@ananand_v.singh https://stackoverflow.com/users/10953776/anand-v-singh in 这个答案 https://stackoverflow.com/a/54721431/10886420，因为：

距离度量不应该作为欧几里德，余弦相似度是更好的度量（例如，使用spaCy https://spacy.io/）因为它没有考虑向量的大小（而且它不应该考虑，这就是 word2vec 或 GloVe 的训练方式）
如果我理解正确的话，将会创建许多人工集群，而我们只需要两个：医学和非医学。此外，医学的中心is not以药物本身为中心。这会带来额外的问题，比如说质心远离药物，或者其他的话，比如，computer or human（或您认为不适合医学的任何其他内容）可能会进入集群。
很难评估结果，更重要的是，这件事是完全主观的。此外，单词向量很难可视化和理解（使用 PCA/TSNE/类似的方法将它们转换为较低的维度 [2D/3D] 来处理如此多的单词，会给我们带来完全无意义的结果 [是的，我已经尝试过这样做，PCA对于较长的数据集，解释方差约为 5%，真的非常低]）。

根据上面强调的问题，我想出了使用的解决方案主动学习 https://en.wikipedia.org/wiki/Active_learning_(machine_learning)，这是解决此类问题的一种很容易被遗忘的方法。

主动学习方法

在机器学习的这个子集中，当我们很难想出一个精确的算法时（比如一个术语成为medical类别），我们要求人类“专家”（实际上不必是专家）提供一些答案。

知识编码

As 阿南德诉辛格 https://stackoverflow.com/users/10953776/anand-v-singh指出，词向量是最有前途的方法之一，我也会在这里使用它（尽管有所不同，在我看来，以一种更干净、更简单的方式）。

我不会在回答中重复他的观点，所以我会添加我的两分钱：

Do not使用上下文化的词嵌入作为当前可用的最先进技术（例如BERT https://arxiv.org/pdf/1810.04805.pdf)
检查你的概念有多少无代表（例如表示为零向量）。应该检查它（并在我的代码中进行检查，到时候将进行进一步的讨论），并且您可以使用其中大多数都存在的嵌入。

使用测量相似度spaCy

这个类衡量之间的相似度medicine编码为 spaCy 的 GloVe 词向量和所有其他概念。

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

此代码将为每个概念返回一个数字，衡量它与质心的相似程度。此外，它还记录了缺少其表示的概念索引。可能会这样称呼：

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
    concepts
)

您可以用您的数据代替new_concepts.json.

Look at 空间加载 https://spacy.io/usage/models并注意到我已经使用了en_vectors_web_lg https://spacy.io/models/en#en_vectors_web_lg。它包括685.000 个独特的词向量（数量很多），并且可以开箱即用地适合您的情况。安装 spaCy 后，您必须单独下载它，上面的链接提供了更多信息。

此外你可能想使用多个质心词，例如添加像这样的词disease or health并对它们的词向量进行平均。我不确定这是否会对您的案件产生积极影响。

其他可能性可能是使用多个质心并计算每个概念和多个质心之间的相似度。在这种情况下，我们可能有一些阈值，这可能会删除一些误报 https://en.wikipedia.org/wiki/False_positives_and_false_negatives，但可能会遗漏一些可能被认为类似于的术语medicine。此外，这会使情况变得更加复杂，但是如果您的结果不令人满意，您应该考虑上面的两个选项（并且只有在这些选项都满足的情况下，才不要在没有事先考虑的情况下跳入这种方法）。

现在，我们对概念的相似性有了一个粗略的衡量。但这是什么意思某个概念与医学有 0.1 的正相似度？这是一个应该归类为医学的概念吗？或者也许那已经太遥远了？

请教专家

为了获得一个阈值（低于该阈值的术语将被视为非医学），最简单的方法是要求人类为我们对一些概念进行分类（这就是主动学习的含义）。是的，我知道这是一种非常简单的主动学习形式，但无论如何我都会这么认为。

我写了一个类sklearn-like界面要求人类对概念进行分类，直到达到最佳阈值（或最大迭代次数）。

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

samples参数描述了每次迭代期间将向专家显示多少个示例（这是最大值，如果已经要求提供示例或没有足够的示例可显示，则返回的示例会较少）。
step表示每次迭代中阈值的下降（我们从 1 开始，表示完全相似）。
change_multiplier- 如果专家回答概念不相关（或大部分不相关，因为返回了多个概念），则步长乘以该浮点数。它用于精确确定之间的阈值step每次迭代都会发生变化。
根据相似度对概念进行排序（概念越相似，越高）

下面的函数询问专家的意见，并根据他的答案找到最佳阈值。

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

示例问题如下所示：

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

...解析专家的答案：

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

最后是整个代码ActiveLearner，根据专家的建议找到最佳相似度阈值：

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

总而言之，您必须手动回答一些问题，但这种方法是way more我认为准确。

此外，您不必检查所有样本，只需检查其中的一小部分即可。您可以决定有多少样本构成一个医学术语（是否显示 40 个医学样本和 10 个非医学样本，是否仍应被视为医学？），这让您可以根据自己的喜好微调此方法。如果存在异常值（例如，50 个样本中有 1 个是非医疗样本），我会认为该阈值仍然有效。

再次：这种方法应该与其他方法混合使用，以尽量减少错误分类的机会。

分类器

当我们从专家那里获得阈值时，分类就会立即完成，这是一个简单的分类类：

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

为了简洁起见，这里是最终的源代码：

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

回答了一些问题后，阈值 0.1（介于[-1, 0.1)被认为是非医疗的，而[0.1, 1]被认为是医学上的）我得到了以下结果：

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

正如您所看到的，这种方法远非完美，因此最后一节描述了可能的改进：

可能的改进

正如一开始提到的，使用我的方法与其他答案混合可能会遗漏类似的想法sport shoe属于medicine如果上述两种启发法出现平局，主动学习方法将更具决定性的一票。

我们也可以创建一个主动学习的整体。我们不会使用一个阈值（例如 0.1），而是使用多个阈值（增加或减少），假设这些阈值是0.1, 0.2, 0.3, 0.4, 0.5.

比方说sport shoe得到，对于每个阈值，它是各自的True/False像这样：

True True False False False,

进行多数投票我们会标记它non-medical以 2 票中的 3 票投票。此外，如果低于阈值的阈值超过了它，那么太严格的阈值也会得到缓解（如果True/False看起来像这样：True True True False False).

我想出的最终可能的改进：在上面的代码中我使用Doc向量，它是创建概念的词向量的平均值。假设缺少一个单词（由零组成的向量），在这种情况下，它会被推得更远medicine质心。你可能不希望这样（正如一些小众医学术语[缩写如gpv或其他]可能会丢失它们的表示），在这种情况下，您可以仅对那些不为零的向量进行平均。

我知道这篇文章很长，所以如果您有任何疑问，请在下面留言。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)