详细描述
我开始使用词嵌入,并发现了大量有关它的信息。到目前为止,我知道我可以训练自己的词向量或使用以前训练过的词向量,例如 Google 或 Wikipedia 的词向量,这些向量可用于英语,但对我来说没有用,因为我正在处理以下语言中的文本巴西葡萄牙语。因此,我继续寻找葡萄牙语中预先训练的词向量,最终发现Hirosan 的预训练词嵌入列表 http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/这把我带到了Kyubyong的词向量 https://github.com/Kyubyong/wordvectors从中我了解到 Rami Al-Rfou 的Polyglot https://sites.google.com/site/rmyeid/projects/polyglot。下载完两者后,我一直试图简单地加载词向量,但没有成功。
简短的介绍
我无法加载预先训练的词向量;我在尝试词向量 https://github.com/Kyubyong/wordvectors and Polyglot https://sites.google.com/site/rmyeid/projects/polyglot.
下载
-
Kyubyong 的预训练 word2vector 格式葡萄牙语单词向量 https://drive.google.com/open?id=0B0ZXk88koS2KRDcwcV9IVWFTeUE;
-
Polyglot 的葡萄牙语预训练词向量 https://doc-0g-54-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/c1ch6rdnp89glqmi8g81ev2somslu7cs/1527537600000/10341224892851088318/*/0B5lWReQPSvmGNEh0VTdmSHlHZ1k?e=download;
加载尝试
奎平的词向量 https://github.com/Kyubyong/wordvectors第一次尝试:按照建议使用 GensimHirosan http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/;
from gensim.models import KeyedVectors
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
word_vectors = KeyedVectors.load_word2vec_format(kyu_path, binary=True)
并返回错误:
[...]
File "/Users/luisflavio/anaconda3/lib/python3.6/site-packages/gensim/utils.py", line 359, in any2unicode
return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
下载的 zip 还包含其他文件,但它们都返回类似的错误。
Polyglot https://sites.google.com/site/rmyeid/projects/polyglot第一次尝试:以下艾尔福斯的指示 http://nbviewer.jupyter.org/gist/aboSamoor/6046170;
import pickle
import numpy
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
words, embeddings = pickle.load(open(pol_path, 'rb'))
并返回错误:
File "/Users/luisflavio/Desktop/Python/w2v_loading_tries.py", line 14, in <module>
words, embeddings = pickle.load(open(polyglot_path, "rb"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd4 in position 1: ordinal not in range(128)
第二次尝试:使用Polyglot 的词嵌入加载函数 https://polyglot.readthedocs.io/en/latest/Embeddings.html;
首先,我们必须通过 pip 安装多语言:
pip install polyglot
现在我们可以导入它:
from polyglot.mapping import Embedding
pol_path = '.../pre-trained_word_vectors/polyglot/polyglot-pt.pkl'
embeddings = Embedding.load(polyglot_path)
并返回错误:
File "/Users/luisflavio/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
额外的信息
我在 MacOS High Sierra 上使用 python 3。
解决方案
奎平的词向量 https://github.com/Kyubyong/wordvectors正如所指出的阿尼什·乔希 https://stackoverflow.com/a/50579950?noredirect=1,加载Kyubyong模型的正确方法是调用Word2Vec的原生加载函数。
from gensim.models import Word2Vec
kyu_path = '.../pre-trained_word_vectors/kyubyong_pt/pt.bin'
model = Word2Vec.load(kyu_path)
尽管我非常感谢 Aneesh Joshi 的解决方案,但多语言似乎是使用葡萄牙语的更好模型。关于那个有什么想法吗?