如何下载 NLTK 数据？

2024-05-05

更新的答案：NLTK 适用于 2.7。我有3.2。我卸载了3.2并安装了2.7。现在可以了！！

我已经安装了 NLTK 并尝试下载 NLTK 数据。我所做的是按照该网站上的说明进行操作：http://www.nltk.org/data.html http://www.nltk.org/data.html

我下载了 NLTK，安装了它，然后尝试运行以下代码：

>>> import nltk
>>> nltk.download()

它给了我如下错误消息：

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    nltk.download()
AttributeError: 'module' object has no attribute 'download'
 Directory of C:\Python32\Lib\site-packages

两者都尝试过nltk.download() and nltk.downloader()，都给了我错误消息。

然后我用了help(nltk)拉出包裹，它会显示以下信息：

NAME
    nltk

PACKAGE CONTENTS
    align
    app (package)
    book
    ccg (package)
    chat (package)
    chunk (package)
    classify (package)
    cluster (package)
    collocations
    corpus (package)
    data
    decorators
    downloader
    draw (package)
    examples (package)
    featstruct
    grammar
    help
    inference (package)
    internals
    lazyimport
    metrics (package)
    misc (package)
    model (package)
    parse (package)
    probability
    sem (package)
    sourcedstring
    stem (package)
    tag (package)
    test (package)
    text
    tokenize (package)
    toolbox
    tree
    treetransforms
    util
    yamltags

FILE
    c:\python32\lib\site-packages\nltk

我确实在那里看到了下载器，不知道为什么它不起作用。 Python 3.2.2，系统Windows vista。

Answer recommended by NLP /collectives/nlp Collective

TL;DR

要下载特定的数据集/模型，请使用nltk.download()函数，例如如果您想下载punkt句子标记器，使用：

$ python3
>>> import nltk
>>> nltk.download('punkt')

如果您不确定需要哪个数据/模型，您可以从数据+模型的基本列表开始：

>>> import nltk
>>> nltk.download('popular')

它将下载“流行”资源列表，其中包括：

<collection id="popular" name="Popular packages">
      <item ref="cmudict" />
      <item ref="gazetteers" />
      <item ref="genesis" />
      <item ref="gutenberg" />
      <item ref="inaugural" />
      <item ref="movie_reviews" />
      <item ref="names" />
      <item ref="shakespeare" />
      <item ref="stopwords" />
      <item ref="treebank" />
      <item ref="twitter_samples" />
      <item ref="omw" />
      <item ref="wordnet" />
      <item ref="wordnet_ic" />
      <item ref="words" />
      <item ref="maxent_ne_chunker" />
      <item ref="punkt" />
      <item ref="snowball_data" />
      <item ref="averaged_perceptron_tagger" />
    </collection>

EDITED

如果有人避免从下载更大的数据集时出现错误nltk, from https://stackoverflow.com/a/38135306/610569 https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')

Updated

从 v3.2.5 开始，NLTK 有了更多信息性的错误消息 https://github.com/nltk/nltk/pull/1806 when nltk_data未找到资源，例如：

>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
    opened_resource = _open(resource_url)
  File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
    return find(path_, path + ['']).open()
  File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  Searched in:
    - '/Users/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

To find nltk_data目录（自动神奇地），参见https://stackoverflow.com/a/36383314/610569 https://stackoverflow.com/a/36383314/610569
To 下载nltk_data走向不同的道路, see https://stackoverflow.com/a/48634212/610569 https://stackoverflow.com/a/48634212/610569
To config nltk_data path（即为 NLTK 设置不同的路径来查找nltk_data), see https://stackoverflow.com/a/22987374/610569 https://stackoverflow.com/a/22987374/610569

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

NLTK