Deep learning 八

2023-12-17

（2）使用预训练的词嵌入

有时可用的训练数据很少，以至于只用手头数据无法学习适合特定任务的词嵌人。那么可以从预计算的嵌人空间中加载嵌入向量(这个嵌人空间是高度结构化的，并且具有有用的属性，即抓住了语言结构的一般特点)，而不是在解决问题的同时学习词嵌入。在自然语言处理中使用预训练的词嵌人，其背后的原理与在图像分类中使用预训练的卷积神经网络是一样的:没有足够的数据来自己学习真正强大的特征，但需要的特征应该是非常通用的,比如常见的视觉特征或语义特征。在这种情况下，重复使用在其他问题上学到的特征，这种做法是有道理的。

3.整合在一起:从原始文本到词嵌入

该模型与之前刚刚见过的那个类似:将句子嵌人到向量序列中，然后将其展平，最后在上面训练一个 Dense 层。但此处将使用预训练的词嵌入。此外，我们将从头开始，先下载IMDB原始文本数据，而不是使用 Keras内置的已经预先分词的IMDB数据。

（1）下载IMDB数据的原始文本

首先，打开http://mng.bz/0tIo，下载原始IMDB数据集并解压接下来，将训练评论转换成字符串列表，每个字符串对应一条评论。也可以将评论标签(正面/负面)转换成 labels 列表。

'''处理IMDB 原始数据的标签'''
import os

imdb_dir = '/home/ubuntu/data/aclImdb' '''改为自己下载的位置'''
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

（2）对数据进行分词

利用前面介绍过的概念，对文本进行分词，并将其划分为训练集和验证集。因为预训练的词嵌入对训练数据很少的问题特别有用(否则，针对于具体任务的嵌人可能效果更好 ).所以又添加了以下限制:将训练数据限定为前 200个样本。因此，需要在读取 200个样本之后学习对电影评论进行分类。

'''对IMDB 原始数据的文本进行分词'''
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # We will cut reviews after 100 words '''在100个单词后截断评论'''
training_samples = 200  # We will be training on 200 samples'''在200个样本上训练'''
validation_samples = 10000  # We will be validating on 10000 samples
'''在10000个样本上验证'''
max_words = 10000  # We will only consider the top 10,000 words in the dataset
'''只考虑数据集中前 10000个最常见的单词'''
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where sample are ordered (all negative first, then all positive).
'''将数据划分为训练集和验证集，但首先要打乱数据，因为一开始数据中的样本是
排好序的(所有负面评论都在前面然后是所有正面评论)'''
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

（3）下载GloVe 词嵌入

打开 https://nlp.stanford.edu/projects/glove，下载2014 年英文维基百科的预计算嵌入。这是个822MB的压缩文件，文件名是 glove.6Bi，里面包含400000个单词(或非单词的标记)的 100 维嵌入向量。解压文件。

（4）对入进行预处理

对解压后的文件(一个.txt 文件)进行解析，构建一个将单词(字符串)映射为其向量表示(数值向量)的索引。

'''解析 GloVe 词嵌人文件'''
glove_dir = '/home/ubuntu/data/' '''自己下载的位置'''

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

接下来，需要构建一个可以加载到 Embedaing层中的嵌入矩阵。它必须是一个形状为(max_words，embedding_dim)的矩阵，对于单词索引(在分词时构建)中索引为i的单词这个矩阵的元素i就是这个单词对应的embedding_dim维向量。注意，索引0不应该代表任何单词或标记，它只是一个占位符。

'''准备 GloVe 词嵌入矩阵'''
embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            '''嵌入索引(embeddings_index)中找不到的词，其嵌入向量全为0'''
            embedding_matrix[i] = embedding_vector

（5）定义模型

将使用与前面相同的模型架构

'''模型定义'''
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

（6）在模型中加载GIoVe嵌入

Embedding层只有一个权重矩阵，是一个二维的浮点数矩阵，其中每个元素i是与索引i相关联的词向量。够简单。将准备好的 GloVe 矩阵加载到 Embeddin 层中，即模型的第一层。

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

此外，需要冻结 Embedding层(即将其trainable 属性设为 False)，其原理和预训练的卷积神经网络特征相同。如果一个模型的一部分是经过预训练的(如 Embedding层)，而另一部分是随机初始化的(如分类器 )，那么在训练期间不应该更新预训练的部分，以避免丢失它们所保存的信息。随机初始化的层会引起较大的梯度更新，会破坏已经学到的特征。

（7）训练模型与评估

模型编译并训练模型

'''训练与评估'''
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

'''绘制结果'''
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

训练的模型很快就开始过拟合，考虑到训练样本很少，这一点也不奇怪。出于同样的原因，验证精度的波动很大，但似乎达到了接近 60%。

注意，结果可能会有所不同。训练样本数太少，所以模型性能严重依赖于选择的200个样本，而样本是随机选择的。如果得到的结果很差，可以尝试重新选择 200 个不同的随机样本，你可以将其作为练习(在现实生活中无法选择自己的训练数据)。
也可以在不加载预训练词嵌入、也不冻结嵌入层的情况下训练相同的模型。在这种情况下你将会学到针对任务的输人标记的嵌入。如果有大量的可用数据，这种方法通常比预训练词嵌人更加强大，但本例只有 200 个训练样本。

'''在不使用预训练词嵌入的情况下，训练相同的模型'''
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

图6-7 不使用预训练词嵌人时的训练损失和验证损失

图 6-8 不使用预训练词嵌入时的训练精度和验证精度

验证精度停留在 50%多一点。因此，在本例中，预训练词嵌人的性能要优于与任务一起学习的嵌入。如果增加样本数量，情况将很快发生变化，可以把它作为一个练习。最后，在测试数据上评估模型。首先，需要对测试数据进行分词。

'''对测试集数据进行分词'''
test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

'''在测试集上评估模型'''
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)

测试精度达到了令人震惊的 56%!只用了很少的训练样本，得到这样的结果很不容易。

小结

1.将原始文本转换为神经网络能够处理的格式。

2.使用Keras模型的Embedding 层来学习针对特定任务的标记嵌入

3.使用预训练词嵌入在小型自然语言处理问题上获得额外的性能提升。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)