TL;DR
要下载特定的数据集/模型,请使用nltk.download()
函数,例如如果您想下载punkt
句子标记器,使用:
$ python3
>>> import nltk
>>> nltk.download('punkt')
如果您不确定需要哪个数据/模型,您可以从数据+模型的基本列表开始:
>>> import nltk
>>> nltk.download('popular')
它将下载“流行”资源列表,其中包括:
<collection id="popular" name="Popular packages">
<item ref="cmudict" />
<item ref="gazetteers" />
<item ref="genesis" />
<item ref="gutenberg" />
<item ref="inaugural" />
<item ref="movie_reviews" />
<item ref="names" />
<item ref="shakespeare" />
<item ref="stopwords" />
<item ref="treebank" />
<item ref="twitter_samples" />
<item ref="omw" />
<item ref="wordnet" />
<item ref="wordnet_ic" />
<item ref="words" />
<item ref="maxent_ne_chunker" />
<item ref="punkt" />
<item ref="snowball_data" />
<item ref="averaged_perceptron_tagger" />
</collection>
EDITED
如果有人避免从下载更大的数据集时出现错误nltk
, from https://stackoverflow.com/a/38135306/610569 https://stackoverflow.com/a/38135306/610569
$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python
>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')
Updated
从 v3.2.5 开始,NLTK 有了更多信息性的错误消息 https://github.com/nltk/nltk/pull/1806 when nltk_data
未找到资源,例如:
>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
opened_resource = _open(resource_url)
File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
return find(path_, path + ['']).open()
File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/Users/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
Related
To find nltk_data
目录(自动神奇地),参见https://stackoverflow.com/a/36383314/610569 https://stackoverflow.com/a/36383314/610569
To 下载nltk_data
走向不同的道路, see https://stackoverflow.com/a/48634212/610569 https://stackoverflow.com/a/48634212/610569
To config nltk_data
path(即为 NLTK 设置不同的路径来查找nltk_data
), see https://stackoverflow.com/a/22987374/610569 https://stackoverflow.com/a/22987374/610569