pytorch 使用BART模型进行中文自动摘要

2023-11-18

系列文章

摘要

fine-tune BART模型实现中文自动摘要
如何fine-tune BART模型参见系列文章1
博文提供了数据集和训练好的模型，从结果可以看出，模型学习到了摘要的能力，但是选择适当的位置进行终止，能力较差。

实现

数据准备

首先安装相应的包

! python --version
! pip install datasets transformers rouge-score jieba
! pip install --upgrade ipywidgets tqdm
! jupyter nbextension enable --py widgetsnbextension
! pip install fire

导入包和数据集
数据集见我上传的：

from ipywidgets import IntProgress
import tqdm 
from datasets import load_dataset
dataset = load_dataset('json', data_files='nlpcc_data.json', field='data')

下面对数据进行处理：

dataset["train"]

在这里插入图片描述

def flatten(example):
    return {
        "document": example["content"],
        "summary": example["title"],
        "id":"0"
    }
dataset = dataset["train"].map(flatten, remove_columns=["title", "content"]) # , remove_columns=["title", "content"]

下面加载完整的BART模型，以备fine-tune

TokenModel = "bert-base-chinese"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(TokenModel)

model_checkpoint = "facebook/bart-large-cnn"
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = "" # BART-12-3

max_input_length = 1024 # input, source text
max_target_length = 256 # summary, target text

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset

在这里插入图片描述

raw_datasets = dataset
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

装载数据

from datasets import dataset_dict
import datasets

train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.1).values()
train_data_txt, test_data_tex = train_data_txt.train_test_split(test_size=0.1).values()
# 装载数据
dd = datasets.DatasetDict({"train":train_data_txt,"validation": validation_data_txt,"test":test_data_tex }) 

raw_datasets = dd
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

处理好的数据集

raw_datasets

在这里插入图片描述

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

预览数据

show_random_elements(raw_datasets["train"])

在这里插入图片描述
安装transformer

! pip install transformers

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

  
import warnings
from pathlib import Path
from typing import List, Tuple, Union

import fire
from torch import nn

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, PreTrainedModel
from transformers.utils import logging

logger = logging.get_logger(__name__)

抽取部分模型 fine-tune

def copy_layers(src_layers: nn.ModuleList, dest_layers: nn.ModuleList, layers_to_copy: List[int]) -> None:
    layers_to_copy = nn.ModuleList([src_layers[i] for i in layers_to_copy])
    assert len(dest_layers) == len(layers_to_copy), f"{len(dest_layers)} != {len(layers_to_copy)}"
    dest_layers.load_state_dict(layers_to_copy.state_dict())


LAYERS_TO_COPY = {
    # maps  num layers in teacher -> num_layers in student -> which teacher layers to copy.
    # 12: bart, 16: pegasus, 6: marian/Helsinki-NLP
    12: {
        1: [0],  # This says that if the teacher has 12 layers and the student has 1, copy layer 0 of the teacher
        2: [0, 6],
        3: [0, 6, 11],      # the first, 7th and 12th decode layers
        4: [0, 4, 8, 11],
        6: [0, 2, 4, 7, 9, 11],
        9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
        12: list(range(12)),
    },
    16: {  # maps  num layers in student -> which teacher layers to copy
        1: [0],
        2: [0, 15],
        3: [0, 8, 15], 
        4: [0, 5, 10, 15],
        6: [0, 3, 6, 9, 12, 15],
        8: [0, 2, 4, 6, 8, 10, 12, 15],
        9: [0, 1, 3, 5, 7, 9, 11, 13, 15],
        12: [0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 13, 15],
        16: list(range(16)),
    },
    6: {1: [0], 2: [0, 5], 3: [0, 2, 5], 4: [0, 1, 3, 5], 6: list(range(6))},
}
LAYERS_TO_SUPERVISE = {
    # maps  num layers in student -> which teacher layers to copy.
    6: {1: [5], 2: [3, 5], 3: [1, 4, 5], 4: [1, 2, 4, 5]},
    12: {1: [11], 2: [5, 11], 3: [3, 7, 11], 6: [1, 3, 5, 8, 10, 11]},
    16: {1: [15], 4: [4, 9, 12, 15], 8: [1, 3, 5, 7, 9, 11, 13, 15]},
}

def create_student_by_copying_alternating_layers(
    teacher: Union[str, PreTrainedModel],
    save_path: Union[str, Path] = "student",
    e: Union[int, None] = None,
    d: Union[int, None] = None,
    copy_first_teacher_layers=False,
    e_layers_to_copy=None,
    d_layers_to_copy=None,
    **extra_config_kwargs
) -> Tuple[PreTrainedModel, List[int], List[int]]:
    
    _msg = "encoder_layers and decoder_layers cannot be both None-- you would just have an identical teacher."
    assert (e is not None) or (d is not None), _msg
    if isinstance(teacher, str):
        AutoTokenizer.from_pretrained(teacher).save_pretrained(save_path)  # purely for convenience
        teacher = AutoModelForSeq2SeqLM.from_pretrained(teacher).eval()
    else:

        assert isinstance(teacher, PreTrainedModel), f"teacher must be a model or string got type {type(teacher)}"
    init_kwargs = teacher.config.to_diff_dict()

    try:
        teacher_e, teacher_d = teacher.config.encoder_layers, teacher.config.decoder_layers
        if e is None:
            e = teacher_e
        if d is None:
            d = teacher_d
        init_kwargs.update({"encoder_layers": e, "decoder_layers": d})
    except AttributeError:  # T5
        teacher_e, teacher_d = teacher.config.num_layers, teacher.config.num_decoder_layers
        if e is None:
            e = teacher_e
        if d is None:
            d = teacher_d
        init_kwargs.update({"num_layers": e, "num_decoder_layers": d})

    # Kwargs to instantiate student: teacher kwargs with updated layer numbers + **extra_config_kwargs
    init_kwargs.update(extra_config_kwargs)

    # Copy weights
    student_cfg = teacher.config_class(**init_kwargs)
    student = AutoModelForSeq2SeqLM.from_config(student_cfg)
    # Start by copying the full teacher state dict this will copy the first N teacher layers to the student.
    info = student.load_state_dict(teacher.state_dict(), strict=False)
    assert info.missing_keys == [], info.missing_keys  # every student key should have a teacher keys.

    if copy_first_teacher_layers:  # Our copying is done. We just log and save
        e_layers_to_copy, d_layers_to_copy = list(range(e)), list(range(d))
        logger.info(
            f"Copied encoder layers {e_layers_to_copy} and decoder layers {d_layers_to_copy}. Saving them to {save_path}"
        )
        student.save_pretrained(save_path)
        return student, e_layers_to_copy, d_layers_to_copy

    # Decide which layers of the teacher to copy. Not exactly alternating -- we try to keep first and last layer.
    if e_layers_to_copy is None:
        e_layers_to_copy: List[int] = pick_layers_to_copy(e, teacher_e)
    if d_layers_to_copy is None:
        d_layers_to_copy: List[int] = pick_layers_to_copy(d, teacher_d)

    try:
        copy_layers(teacher.model.encoder.layers, student.model.encoder.layers, e_layers_to_copy)
        copy_layers(teacher.model.decoder.layers, student.model.decoder.layers, d_layers_to_copy)
    except AttributeError:  # For t5, student.model.encoder.layers is called student.encoder.block
        copy_layers(teacher.encoder.block, student.encoder.block, e_layers_to_copy)
        copy_layers(teacher.decoder.block, student.decoder.block, d_layers_to_copy)
    logger.info(
        f"Copied encoder layers {e_layers_to_copy} and decoder layers {d_layers_to_copy}. Saving them to {save_path}"
    )
    student.config.init_metadata = dict(
        teacher_type=teacher.config.model_type,
        copied_encoder_layers=e_layers_to_copy,
        copied_decoder_layers=d_layers_to_copy,
    )
    student.save_pretrained(save_path)
    # Save information about copying for easier reproducibility

    return student, e_layers_to_copy, d_layers_to_copy

def pick_layers_to_copy(n_student, n_teacher):
    try:
        val = LAYERS_TO_COPY[n_teacher][n_student]
        return val
    except KeyError:
        if n_student != n_teacher:
            warnings.warn(
                f"no hardcoded layers to copy for teacher {n_teacher} -> student {n_student}, defaulting to first {n_student}"
            )
        return list(range(n_student))

model, list_en, list_de = create_student_by_copying_alternating_layers(model, 'trian.pth', 12, 3)

batch_size = 2
args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,  # demo
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=batch_size,  # demo
    per_device_eval_batch_size=batch_size,
    # learning_rate=3e-05,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

import jieba
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(jieba.cut(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(jieba.cut(label.strip())) for label in decoded_labels]
    
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    
    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)
    
    return {k: round(v, 4) for k, v in result.items()}

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

训练与训练平台

开始训练，我是用的华为云，colab的GPU时间不足以完成训练

trainer.train()

在这里插入图片描述

保存和加载模型

torch.save(model.state_dict(), "BART.pth")

import torch
model.load_state_dict(torch.load('BART.pth'))

def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples,
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str

测试

可以做一个测试：

(结果是一个新闻，有领导人姓名没法放上来，我就不放了，可以看github)

能学习到部分摘要能力，但后面该终止摘要的地方没有自动终止,仍有不足

可供下载的数据集与模型

数据集：nlpcc_data.json
模训练好的模型：BART模型-包含网络参数

Github

https://github.com/downw/summrization

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

NLP

Pytorch

深度学习

pytorch 使用BART模型进行中文自动摘要的相关文章

如何将地名词典或词典表示为 crf++ 中的特征？

如何使用地名词典或词典作为功能CRF https taku910 github io crfpp 详细说明假设我想对人名进行 NER 并且我有一个包含常见人名的地名词典或字典我想使用这个地名词典作为 crf 的输入我该怎么做我正在
使用 Hadoop MapReduce 的计算语言学项目构想

我需要做一个关于计算语言学课程的项目是否有任何有趣的语言问题其数据密集程度足以使用 Hadoop MapReduce 来解决解决方案或算法应尝试分析并提供语言领域的一些见解但是它应该适用于大型数据集以便我可以使用 hado
使用 nltk 进行分块

如何从给定模式的句子中获取所有块示例 NP
PyTorch：如何检查训练期间某些权重是否没有改变？

如何检查 PyTorch 训练期间某些权重是否未更改据我了解一种选择可以是在某些时期转储模型权重并检查它们是否通过迭代权重进行更改但也许有一些更简单的方法有两种方法可以解决这个问题 First for name param in
SGDClassifier 每次为文本分类提供不同的准确度

我使用 SVM 分类器将文本分类为好文本和乱码我正在使用 python 的 scikit learn 并按如下方式执行 Created on May 5 2017 import re import random import numpy
PyTorch 中复数矩阵的行列式

有没有办法在 PyTorch 中计算复矩阵的行列式 torch det未针对 ComplexFloat 实现不幸的是目前尚未实施一种方法是实现您自己的版本或简单地使用np linalg det 这是一个简短的函数它计算我使用 LU
使 CUDA 内存不足

我正在尝试训练网络但我明白了我将批量大小设置为 300 并收到此错误但即使我将其减少到 100 我仍然收到此错误更令人沮丧的是在 1200 个图像上运行 10 epoch 大约需要 40 分钟有什么建议吗错了我怎样才能加快这
如何更新 PyTorch 中神经网络的参数？

假设我想将神经网络的所有参数相乘PyTorch 继承自的类的实例torch nn Module http pytorch org docs master nn html torch nn Module by 0 9 我该怎么做呢 Let n
将 Keras (Tensorflow) 卷积神经网络转换为 PyTorch 卷积网络？

Keras 和 PyTorch 使用不同的参数进行填充 Keras 需要输入字符串而 PyTorch 使用数字有什么区别如何将一个转换为另一个哪些代码在任一框架中获得相同的结果 PyTorch 还采用参数 in channels o
python中的语音识别持续时间设置问题

我有一个 Wav 格式的音频文件我想转录我的代码是 import speech recognition as sr harvard sr AudioFile speech file wav with harvard as source
ANEW 字典可以用于 Quanteda 中的情感分析吗？

我正在尝试找到一种方法来实施英语单词情感规范荷兰语以便使用 Quanteda 进行纵向情感分析我最终想要的是每年的平均情绪以显示任何纵向趋势在数据集中所有单词均由 64 名编码员按照 7 分李克特量表在四个类别上进行评分这提
将复数名词转换为单数名词

如何使用 R 将复数名词转换为单数名词我使用 tagPOS 函数来标记每个文本然后提取所有标记为 NNS 的复数名词但是如果我想将这些复数名词转换为单数该怎么办 library openNLP library tm acq o lt
Pytorch“展开”等价于 Tensorflow [重复]

这个问题在这里已经有答案了假设我有大小为 50 50 的灰度图像在本例中批量大小为 2 并且我使用 Pytorch Unfold 函数如下所示 import numpy as np from torch import nn from
PyTorch 中的交叉熵

交叉熵公式但为什么下面给出loss 0 7437代替loss 0 since 1 log 1 0 import torch import torch nn as nn from torch autograd import Variable
在 Pytorch 中估计高斯模型的混合

我实际上想估计一个以高斯混合作为基本分布的归一化流所以我有点被火炬困住了但是您可以通过估计 torch 中高斯模型的混合来在代码中重现我的错误我的代码如下 import numpy as np import matplotlib p
阻止斯坦福核心 NLP 服务器输出它收到的文本

我正在运行一个斯坦福核心自然语言处理 http stanfordnlp github io CoreNLP server java mx4g cp edu stanford nlp pipeline StanfordCoreNLPServe
保存具有自定义前向功能的 Bert 模型并将其置于 Huggingface 上

我创建了自己的 BertClassifier 模型从预训练开始然后添加由不同层组成的我自己的分类头微调后我想使用 model save pretrained 保存模型但是当我打印它并从预训练上传时我看不到我的分类器头代码如下
如何在Python中使用多处理来加速循环执行

我有两个清单列表 A 包含 500 个单词列表 B 包含 10000 个单词我正在尝试为列表 A 找到与 B 相关的相似单词我正在使用 Spacy 的相似函数我面临的问题是计算需要很长时间我是多处理使用的新手因此请求帮助如何
缩短文本并仅保留重要句子

德国网站 nandoo net 提供了缩短新闻文章的可能性如果使用滑块更改百分比值文本会发生变化并且某些句子会被遗漏您可以在这里看到它的实际效果 http www nandoo net read article 299925 http
如何在 PyTorch 中对子集使用不同的数据增强

如何针对不同的情况使用不同的数据增强转换 Subset在 PyTorch 中吗例如 train test torch utils data random split dataset 80000 2000 train and test将具

随机推荐

Bootstrap插件（六）——警告框（alert.js）

bootstrap中的alert和原本的alert弹框可不太一样原来我们熟悉的弹框是在执行某个动作的时候浏览器和我们弹出来的一个提示框比如下面这样而我们这里的警告框是在html内容之间的提示内容只是他有着醒目的颜色以此来达到提醒
h5标签上实现文字空格

在vue项目中实现文字之间的空格 div class top p class groupLeader 组 xa0 xa0 xa0 长 span xxx span p p class standingGroupLeader 副组长 span
[YOLO专题-23]：YOLO V5 - ultralytics代码解析-网络子结构详解

作者主页文火冰糖的硅基工坊文火冰糖王文兵的博客文火冰糖的硅基工坊 CSDN博客本文网址 https blog csdn net HiWangWenBing article details 122369993 目录第1章网络总
[QT编程系列-2]：C++图形用户界面编程，QT框架快速入门培训 - 1- 预备知识

目录概述 1 前置条件 1 1 C 1 2 图形界面 1 3 图形程序集成开发环境 1 4 图形程序开发框架 1 5 跨平台特性 1 6 QT快速感知 1 6 1 QT的典型应用 1 6 2 QT的特点 1 6 3 QT跨平台集成开发环境
Qt QProcess

目录概述实现一函数接口二执行命令三管道概述本文介绍在Linux环境下使用Qt中的QProcess类执行shell命令并获取输出头文件 include
区块链数字签名详解

有一点比较难以理解的答案就是私钥加密公钥可以解密公钥加密私钥可以解密 RSA的原理两个大质数 p q 乘积 n 难以逆向求解所以pq是对等的公钥和私钥也是对等的区块链从数字货币到信用社会读书笔记这张图来自于新生大学的周兵
【Vue项目实战】Vue3动画神操作！教你如何实现PPT一样的动画效果！

文章目录前言一 Animate css是什么二安装和使用 1 安装 2 基本用法 3 JavaScript用法三动画制作 1 弹入动画总结前言最近写界面的时候发现一个前端组件很好玩他就是鼎鼎大名的 Animate cs
海康工业摄像头调用（linux基于python和opencv）

1 下载官网客户端其中包含SDK 官方网站海康机器人机器视觉下载中心安装deb文件 sudo dpkg i deb文件名 2 运行客户端 cd opt MVS bin MVS sh 如果连不上看看是不是usb3 0的接口 3 调
ThinkPHP 日志信息泄露漏洞复现

ThinkPHP 日志信息泄露漏洞复现漏洞简介 ThinkPHP在开启DEBUG的情况下会在Runtime目录下生成日志而且debug很多网站都没有关 ThinkPHP默认安装后也会在Runtime目录下生成日志 THINKPHP3
基于SSM 和 layui 的增删查改

开发工具 IDEA 2021 WebStorm 2021 Mysql 8 0 开发环境 JDK 8 TomCat 8 5 81 apache maven 3 6 1 技术点 Spring SpringMVC Mybatis Mysql Ht
express使用简介

框架搭建 1 安装脚手架 npm install g express generator 2 创建项目 express myapp 查看项目目录可以知道启动文件www作用是提供http服务 express是一个全栈环境所以有views
维普期刊瑞数5

郑重声明本项目的所有代码和相关文章仅用于经验技术交流分享禁止将相关技术应用到不正当途径因为滥用技术产生的风险与本人无关加载流程 url aHR0cDovL2xpYi5jcXZpcC5jb20vUWlrYW4vU2VhcmNoL0l
TCP/IP详解卷1:协议学习笔记第十六章 BOOTP:引导程序协议

一个无盘系统在不知道自身IP地址情况下进行系统引导时能通过RARP协议获取它的IP地址使用RARP会有两个问题 1 IP地址是返回的唯一结果 2 RARP使用链路层广播 RARP请求不会被路由器转发每个实际网络必须设置一个RARP服务
leetcode算法面试题：打家劫舍问题

题目你是一个专业的小偷计划偷窃沿街的房屋每间房内都藏有一定的现金影响你偷窃的唯一制约因素就是相邻的房屋装有相互连通的防盗系统如果两间相邻的房屋在同一晚上被小偷闯入系统会自动报警给定一个代表每个房屋存放金额的非负整数数组计算你
50个c/c++源代码网站

C C 是最主要的编程语言这里列出了50名优秀网站和网页清单这些网站提供c c 源代码这份清单提供了源代码的链接以及它们的小说明我已尽力包括最佳的C C 源代码的网站这不是一个完整的清单您有建议可以联系我我将欢迎您的建议以
Maven下载、安装和配置教程(2023年6月10日)

Maven下载安装和配置教程 2023年6月10日一下载安装包二安装三配置环境变量系统是win10 四验证是否安装成功五配置文件六 idea里配置一下载安装包链接 https pan baidu com s 1
西部数据出现“WD SES Device USB Device”怎么办，而且说明书全是英文。

您好 wd ses device driver这个驱动程序可以在baidu中输入关键词找到什么驱动之家驱动人生之类的专业驱动网站也都是有的 western digital的移动硬盘驱动程序安装步骤请见下图转载于 https www c
glob.glob in python

reference glob glob in python 功能返回一个某一种文件夹下面的某一类型文件路径列表
Golang网络编程

互联网协议介绍引入 1 物理层 Physical Layer 功能物理层负责定义物理介质传输数据的方式和规范它传输的是原始数据比特流协议 Ethernet Wi Fi USB 光纤等例子将数据通过网线传输的过程类似于我们通过电话线
pytorch 使用BART模型进行中文自动摘要

系列文章如何从大型模型 BART fine tune一个小模型及代码实现文本自动摘要评价方法金字塔方法 pytorch 使用BART模型进行中文自动摘要目录系列文章摘要实现数据准备装载数据预览数据抽取部分模型 fine

pytorch 使用BART模型进行中文自动摘要

系列文章

目录

摘要

实现

数据准备

装载数据

预览数据

抽取部分模型 fine-tune

训练与训练平台

保存和加载模型

测试

可供下载的数据集与模型

Github

pytorch 使用BART模型进行中文自动摘要 的相关文章

随机推荐

热门标签

pytorch 使用BART模型进行中文自动摘要的相关文章