paddlenlp调用ERNIE、使用ERNIEKIT

2023-11-13

paddle调用ERNIE

安装paddle和paddlenlp

pip安装paddlepaddle和paddlenlp：
版本：
paddle.version: 2.2.2
paddlenlp.version: 2.4.5

pip install paddlepaddle
pip install paddlenlp

（下载）加载ERNIE预训练模型

import paddlenlp 

# 加载ERNIE预训练模型
MODEL_NAME = "ernie-3.0-medium-zh"
ernie_model = paddlenlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)

tokenizer

类似于BERT，vocab中几个特殊符号如下：
[PAD] 0：
[CLS] 1：位于整个输入的首尾
[SEP] 2：位于每个句子的句尾
[MASK] 3
[UNK] ：表示unk_token，其不存在与vocab中，在每个task的配置文件中可以自定义，默认是[UNK]

对文本进行tokenizer处理

model_dir = 'C:\\Users\\cqf\\.paddlenlp\\models\\ernie-3.0-medium-zh' # 上一步下载模型时，默认下载到该地址
tokenizer = paddlenlp.transformers.ErnieTokenizer.from_pretrained(model_dir) # MODEL_NAME，第一次加载（本地没有模型）时可以直接指定模型名，会自动下载保存
encoded_text = tokenizer(text=["冬天来了，春天还会远吗？", '秋天来了，一群大雁往南飞。']) # text参数可以是字符串，也可以是多个字符串组成的列表
print(encoded_text) # 结果：{'input_ids': [[1, 196, 1296, 125, 61, 15, 4, 740, 125, 201, 32, 629, 1114, 12045, 2], [1, 1050, 125, 61, 15, 4, 7, 626, 19, 2665, 638, 219, 706, 12043, 2]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

结果：
在这里插入图片描述

获取文本语义特征向量表示

# 将tokenizer的input_ids、token_type_ids转化为tensor
input_ids = paddle.to_tensor(encoded_text['input_ids']) # padding后作为输入
token_type_ids = paddle.to_tensor(encoded_text['token_type_ids'])

# 将tensor形式的input_ids、token_type_ids输入到ERNIE模型，得到输出：
# sequence_output：每个输入token的语义特征表示，shape=(batch_size, seq_len, hidden_size)
# pooled_output：整个句子的语义特征表示，shape=(batch_size, hidden_size)
sequence_output, pooled_output = ernie_model(input_ids, token_type_ids)

print("Token wise output: {}, Pooled output: {}".format(
    sequence_output.shape, pooled_output.shape)) # Token wise output: [2, 15, 768], Pooled output: [2, 768]
print(sequence_output, pooled_output)

特征向量结果：
在这里插入图片描述

中文官方说明文档：https://paddlenlp.readthedocs.io/zh/latest/get_started/quick_start.html
文本分类模型（ErnieForSequenceClassification）实操文档：https://aistudio.baidu.com/aistudio/projectdetail/1294333

ERNIEKIT实践

需要已经安装了paddlepaddle、paddlenlp。

安装、配置nltk

① 安装nltk模块（在conda环境中）：

conda install nltk

② 下载nltk数据包：nltk_data
直接在网上手动下载（我的网盘中）：
链接：https://pan.baidu.com/s/1VrWiG3orbQanK917540rxw
提取码：1xpu
解压后（保证文件夹名为nltk_data）保存到：C:\Users[你的用户名]\AppData\Roaming
③ 进入python，下载数据包：

import nltk
from nltk.book import *

结果显示如下，nltk安装成功。
在这里插入图片描述

下载ERNIEKIT源码

GitHub下载地址：https://github.com/PaddlePaddle/ERNIE
也可以使用git命令下载：

git clone https://github.com/PaddlePaddle/ERNIE.git

运行ERNIEKIT

按照GitHub上步骤一步步执行。
注意：（目前测试了 text_classification 和 text_matching 两个任务，text_classification 会报UnicodeDecodeError错误）
（1）下载的模型会保存在：.\ERNIE-ernie-kit-open-v1.0\applications\models_hub
（2）各个task中example文件下是对应的配置文件（_infer后缀的是预测配置文件，不带的为训练配置文件），根据自身设备情况，需要修改对应的 cpu 或者 gpu
如：“PADDLE_PLACE_TYPE”: “cpu”
（3）训练时，读取数据过程会报UnicodeDecodeError错误，需要修改
① 需修改文件：erniekit\data\data_set_reader\basic_dataset_reader.py
修改位置，第120行：with open(file_path, "r+") as f:
修改后代码：with open(file_path, "r+", encoding='utf-8') as f:
② 需修改文件：erniekit\data\vocabulary.py
修改位置，第30行：file_vocab = open(self.vocab_path)
修改后代码：file_vocab = open(self.vocab_path, 'r+', encoding='utf-8')
（4）预测时，配置文件中读取的模型文件路径（“inference_model_path”）需要根据自身情况进行修改：
如：“inference_model_path”: “./output/cls_ernie_3.0_base_fc_ch_dy/save_inference_model/inference_step_251/”

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)