NLP 中 Bilstm-attentio的使用

2023-11-12

NLP 中 Bilstm-attentio的使用

bilstm-attention 理解
- bilstm-attention的作用
- bilstm-attention 编码实现`

bilstm-attention 理解

bilstm-attention的作用

1：输入空间和输出空间的理解
在NLP的任务框架中，最基本的流程是初步编码（比如one-hot vector,或者预训练词向量）作为输入，比如一句话进来我开心，如果使用在分类模型中，我们就需要融合我开心这三个字，我们将这三个字映射成向量，比如 [10,3] 3代表3个字，10代表表示向量的长度。如果在一个分类的任务中，我们如何做到将其融合成一个词向量的表示呢，直白的来说我想将 [10,3] 矩阵,从线性代数的角度来看，可以认为是3个列向量，那么句子向量的表示，就可以表示成 [10,3]*[3,1] [3,1]是什么呢，我们就理解成权重，比如这句话如果在情感识别任务中，那么开心比较重要，那么 [3,1]的矩阵写成 [0,0,1] 那么我们就可以直接抽取开心的输入向量了。输出也就是[10,1]的这个开心这个词的向量。但是因为任务是前变万化的，我们就想机器是否可以自适应的来学习输入向量和输出向量呢，这就是本文的任务所在。

2: 为何要将输入的向量的重新表示

我们是否可以直接使用 [10,3]的输入向量作为输入呢，可以，但是前提是最好这些词的向量预先训练过，我们在学习线性代数的时候，还记得我们经常使用的 PCA呢，我们经常将一个向量表示成 out=w1v1+w2v2+… v1,v2,v3 那一般是什么，这就是我们通过pca计算出来的独立主成分，它通过主要的特征向量来表示输出向量。 ok，其实大多数编码器就是做的这个工作。

bilstm-attention 编码实现`

如下是代码的，语料分析部分，这部分不做分析了

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
import torch.utils.data as Data

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Bi-LSTM(Attention) Parameters
batch_size = 4
embedding_dim = 2
n_hidden = 5 # number of hidden units in one cell
num_classes = 2  # 0 or 1

# 3 words sentences (=sequence_length is 3)
sentences = ["i love you", "he loves me", "she likes baseball", "i hate you", "sorry for that", "this is awful"]
labels = [1, 1, 1, 0, 0, 0]  # 1 is good, 0 is not good.

vocab = list(set(" ".join(sentences).split()))
word2idx = {w: i for i, w in enumerate(vocab)}
vocab_size = len(word2idx)

def make_data(sentences):
  inputs = []
  for sen in sentences:
      inputs.append(np.asarray([word2idx[n] for n in sen.split()]))

  targets = []
  for out in labels:
      targets.append(out) # To using Torch Softmax Loss function

  return torch.LongTensor(inputs), torch.LongTensor(targets)

inputs, targets = make_data(sentences)
dataset = Data.TensorDataset(inputs, targets)
loader = Data.DataLoader(dataset, batch_size, True)

我们作中分析如下部分：

class BiLSTM_Attention(nn.Module):
    def __init__(self):
        super(BiLSTM_Attention, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional=True)
        self.out = nn.Linear(n_hidden * 2, num_classes)

nn.Embedding 的作用就是将我们输入的字或者词转成词输入向量，训练出来的就是基于我们自己语料的词向量模型。 self.lstm 就是我们使用的编码层，我们通过双向lstm将我们的输入编码输出，然后就是我们上文所述的，我们需要将比如[10,3]的输入向量，表示成一个总体的我们可以称之为句向量表示的 [10.1] 最后 self.out 就是输出层，很直白，我们需要将最终的输入向量，映射为最终的分类数。
实现方式一，我们通过隐藏层的参数作为attention的编码。

  def attention_net(self, lstm_output, final_state):
        batch_size = len(lstm_output)
        hidden = final_state.view(batch_size, -1, 1)   # hidden : [batch_size, n_hidden * num_directions(=2), n_layer(=1)]
        attn_weights = torch.bmm(lstm_output, hidden).squeeze(2) # attn_weights : [batch_size, n_step]
        soft_attn_weights = F.softmax(attn_weights, 1)

        # context : [batch_size, n_hidden * num_directions(=2)]
        context = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
        return context, soft_attn_weights

实现方二，我们可以直接通过linear来实现。

  def __init__(self):
        super(BiLSTM_Attention, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, n_hidden, bidirectional=True)
        #这里我们直接使用linear层，来做attention。
        self.attention=nn.Linear(n_hidden*2,1)
        self.out = nn.Linear(n_hidden * 2, num_classes)
    def attention_net(self, lstm_output, final_state):
        batch_size = len(lstm_output)
        hidden = final_state.view(batch_size, -1,
                                  1)  # hidden : [batch_size, n_hidden * num_directions(=2), n_layer(=1)]
        # attn_weights = torch.bmm(lstm_output, hidden).squeeze(2)  # attn_weights : [batch_size, n_step]
        #第二种方式实现。我们通过linear层来实现，在试验中，效果更好。
        attn_weights=self.attention(lstm_output).squeeze(2)
        soft_attn_weights = F.softmax(attn_weights, 1)

        # context : [batch_size, n_hidden * num_directions(=2)]
        context = torch.bmm(lstm_output.transpose(1, 2), soft_attn_weights.unsqueeze(2)).squeeze(2)
        return context, soft_attn_weights

这一层，就是我们这里要使用的注意力模型，注意力模型学的是什么呢，就是学习的我们上文中描述的 [3,1]向量，我们想计算出针对[10,3]这三个列向量中，每一个列向量对应的权重。最终的context就是将输入空间映射成最终的输出空间。

好了，到目前，我们基本组件已经构建完成，下面我们进行，前向网络的构建。

 def forward(self, X):
        '''
        X: [batch_size, seq_len]
        '''
        input = self.embedding(X) # input : [batch_size, seq_len, embedding_dim]
        input = input.transpose(0, 1) # input : [seq_len, batch_size, embedding_dim]

        # final_hidden_state, final_cell_state : [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
        output, (final_hidden_state, final_cell_state) = self.lstm(input)
        output = output.transpose(0, 1) # output : [batch_size, seq_len, n_hidden]
        attn_output, attention = self.attention_net(output, final_hidden_state)
        return self.out(attn_output), attention # model : [batch_size, num_classes], attention : [batch_size, n_step]

X为输入，输入的维度为 [4,3] 即 batch_size,seq_len.
self.embedding(X) 将X映射进词向量空间。因此，input 变为 [4,3,2]
因为lstm 是序列模型，所以 batch_size和seq_len 交换，变成[3,4,2].进入self.lstm lstm为一个序列模型，每一个时间步长都会生成一个输出。看下具体lstm的网络 LSTM(2, 5, bidirectional=True)。那么输出时为 [3,4,52] 因为为双向，所以有两个维度为5的拼接在一起。所以输出为 [3,4,10] lstm网络有两个各隐藏层，分别为 c和h，我们一般只保留最后一个一个时间序列的隐藏层，所有 c和h的输出均为 [2,4,5] 2因为我们是双向的。
下面进入我们的重点代码attention，上面我们已经描述过，是为了求得针对编码输出的每个向量的输出。编码输出为 [3,4,10] 转置成[batch_size,seq_len,out_dim] [4,3,10] 那么我们需要有一个向量去做attention，这个向量的设计方法挺多，这里选择了用隐藏层的输出[2,4,5] 作为attention，当然我们也可以选择其他。比如直接使用 nn.linear ，后面我们会描述。我们可以得知 attention_weights 理论为 [4,3,1] ，我们可以去掉batch_size, 那么 encoder 层是 [10,3] 那么 attention 层求得的权重向量为 [3,1] , 然后用我们上述的讲过的 [10,3][3,1] 得到了 [10.1]的句向量。到这里，我们基本做完了基于lstm的注意力机制。

核心重点：

1：设计计算出权重向量的方式。我们这里有两种实现，1：使用隐藏层来实现。2：使用linear层实现。

model = BiLSTM_Attention().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training
for epoch in range(500):
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        pred, attention = model(x)
        loss = criterion(pred, y)
        if (epoch + 1) % 10 == 0:
            print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
            writer.add_scalar("Loss/train", loss, epoch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
writer.flush()
#save the model
torch.save(model.state_dict(),"bi-lstmattention-para")

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

自然语言处理