有人有 NLTK 的分类 XML 语料库阅读器吗?

2024-01-26

有人为 NLTK 编写过分类 XML 语料库阅读器吗?

我正在使用带注释的纽约时报语料库。它是一个 XML 语料库。 我可以用以下命令读取文件XML语料库阅读器 https://stackoverflow.com/questions/6837566/can-nltks-xmlcorpusreader-be-used-on-a-multi-file-corpus但我想使用 NLTK 的一些类别功能。有一个不错的教程 https://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora用于子类化 NLTK 阅读器。我可以继续写这个,但如果有人已经这样做了,我希望能节省一些时间。

如果没有的话我会发布我写的内容。


这是 NLTK 的分类 XML 语料库阅读器。它基于本教程。 https://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora这使您可以在 XML 语料库(例如纽约时报注释语料库)上使用 NLTK 的基于类别的功能。

将此文件命名为 CategorizedXMLCorpusReader.py 并将其导入为:

import imp                                                                                                                                                                                                                     
CatXMLReader = imp.load_source('CategorizedXMLCorpusReader','PATH_TO_THIS_FILE/CategorizedXMLCorpusReader.py')  

然后您可以像任何其他 NLTK 阅读器一样使用它。例如,

CatXMLReader = CatXMLReader.CategorizedXMLCorpusReader('.../nltk_data/corpora/nytimes', file_ids, cat_file='PATH_TO_CATEGORIES_FILE')

我仍在研究 NLTK,因此欢迎任何更正或建议。

# Categorized XML Corpus Reader                                                                                                                                                                                                  

from nltk.corpus.reader import CategorizedCorpusReader, XMLCorpusReader
class CategorizedXMLCorpusReader(CategorizedCorpusReader, XMLCorpusReader):
    def __init__(self, *args, **kwargs):
        CategorizedCorpusReader.__init__(self, kwargs)
        XMLCorpusReader.__init__(self, *args, **kwargs)
    def _resolve(self, fileids, categories):
        if fileids is not None and categories is not None:
            raise ValueError('Specify fileids or categories, not both')
        if categories is not None:
            return self.fileids(categories)
        else:
            return fileids

        # All of the following methods call the corresponding function in ChunkedCorpusReader                                                                                                                                    
        # with the value returned from _resolve(). We'll start with the plain text methods.                                                                                                                                      
    def raw(self, fileids=None, categories=None):
        return XMLCorpusReader.raw(self, self._resolve(fileids, categories))

    def words(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.words(self, self._resolve(fileids, categories))                                                                                                                                          
        # Can I just concat words over each file in a file list?                                                                                                                                                                 
        words=[]
        fileids = self._resolve(fileids, categories)
        # XMLCorpusReader.words works on one file at a time. Concatenate them here.                                                                                                                                              
        for fileid in fileids:
            words+=XMLCorpusReader.words(self, fileid)
        return words

    # This returns a string of the text of the XML docs without any markup                                                                                                                                                       
    def text(self, fileids=None, categories=None):
        fileids = self._resolve(fileids, categories)
        text = ""
        for fileid in fileids:
            for i in self.xml(fileid).getiterator():
                if i.text:
                    text += i.text
        return text

    # This returns all text for a specified xml field                                                                                                                                                                            
    def fieldtext(self, fileids=None, categories=None):
        # NEEDS TO BE WRITTEN                                                                                                                                                                                                    
        return

    def sents(self, fileids=None, categories=None):
        #return CategorizedCorpusReader.sents(self, self._resolve(fileids, categories))                                                                                                                                          
        text = self.words(fileids, categories)
        sents=nltk.PunktSentenceTokenizer().tokenize(text)
        return sents

    def paras(self, fileids=None, categories=None):
        return CategorizedCorpusReader.paras(self, self._resolve(fileids, categories))
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

有人有 NLTK 的分类 XML 语料库阅读器吗? 的相关文章

随机推荐