Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题

2024-01-23

该模型无法记住之前训练它的标签我知道这是“灾难性的遗忘”，但似乎没有例子或博客可以帮助解决这个问题。对此最常见的反应是这个博客是这样的https://explosion.ai/blog/pseudo-rehearsal-catastropic-forgetting https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting但这现在已经很旧了，没有帮助

这是我的代码：

from __future__ import unicode_literals, print_function
import json
labeled_data = []
with open(r"/content/emails_labeled.jsonl", "r") as read_file:
    for line in read_file:
        data = json.loads(line)
        labeled_data.append(data)

TRAIN_DATA = []
for entry in labeled_data:
    entities = []
    for e in entry['labels']:
        entities.append((e[0], e[1],e[2]))
    spacy_entry = (entry['text'], {"entities": entities})
    TRAIN_DATA.append(spacy_entry)       
import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


# new entity label
LABEL = "OIL"

# training data
# Note: If you're using an existing model, make sure to mix in examples of
# other entity types that spaCy correctly recognized before. Otherwise, your
# model might learn the new type, but "forget" what it previously knew.
# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting
'''
TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]
'''

@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(model='/content/LinkModelOutput', new_model_name="Oil21", output_dir='/content/Last', n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up
    #ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.entity.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "Here is Hindustan petroleum's oil reserves coup in Australia. Details can be found at https://www.textfixer.com/tools/remove-line-breaks.php?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)


if __name__ == "__main__":
    plac.call(main)

数据注释是在“Daccano”上完成的。看一下数据：

{"id": 174, "text": "service\tmarathon petroleum reduces service postings marathon petroleum co said it reduced the contract price it will pay for all grades of service oil one dlr a barrel effective today the decrease brings marathon s posted price for both west texas intermediate and west texas sour to dlrs a bbl the south louisiana sweet grade of service was reduced to dlrs a bbl the company last changed its service postings on jan reuter", "meta": {}, "annotation_approver": null, "labels": [[61, 70, "OIL"], [147, 150, "OIL"]]}
{"id": 175, "text": "mutual funds\tmunsingwear inc mun th qtr jan loss shr loss cts vs loss seven cts net loss vs loss revs mln vs mln year shr profit cts vs profit cts net profit vs profit revs mln vs mln avg shrs vs note per shr adjusted for for stock split july and for split may reuter", "meta": {}, "annotation_approver": null, "labels": []}

我不是 spacy 专家，但我也遇到了同样的问题。有一些要点是必需的：注释工具、训练数据量、正确预测实体的混合。首先确保您的训练数据已通过您选择的工具正确标记（您不会收到用户警告）。为了获得良好的预测，您的模型需要大量数据。这意味着您要训练的每个实体至少有 200 个示例。我个人标记尽可能多的数据。 spacy 的制造商建议混合模型正确预测的实体。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题的相关文章

有没有一种方法可以将python对象直接存储在mongoDB中而不需要序列化它们

我在某处读到过您可以使用 BSON 将 python 对象更具体地说是字典作为二进制文件存储在 MongoDB 中但是现在我找不到任何与此相关的文档有人知道具体如何做到这一点吗没有办法在不序列化的情况下将对象存储在文件数据库
从sklearn PCA获取特征值和向量

如何获取 PCA 应用程序的特征值和特征向量 from sklearn decomposition import PCA clf PCA 0 98 whiten True converse 98 variance X train clf f
无故运行测试时 PyCharm 抛出“AttributeError: 'module' object has no attribute”

因此我有一个 Django REST Framework 项目有一天它无法在 PyCharm 中运行测试从命令行我可以使用它们来运行它们paver or the manage py直接地曾经有一段时间当我们没有在文件顶部导入类的超
如何在VIM中设置文件的正确路径？

每当我击中 pwd在 vim 中命令总是返回路径C Windows system32 即使我在桌面上的 Python 文件中所以每当我跑步时 python 命令返回 python can t open file Users myname
小部件之间的自定义信号

尝试将信号从一个 gtk EventBox 子级发送到另一个在 init HeadMode 第 75 行上出现错误类型错误未知信号名称消息发送 why usr bin env python coding utf8 import p
根据 Pandas 中的列表对多列进行排序

感谢有关如何根据 pandas 中的倍数列表对给定多列进行排序的任何提示如下所示 import pandas as pd sort a a d e sort b s1 s3 s6 sort c t1 t2 t3 df pd DataFra
ValueError：不支持连续[重复]

这个问题在这里已经有答案了我正在使用 GridSearchCV 进行线性回归的交叉验证不是分类器也不是逻辑回归我还使用 StandardScaler 对 X 进行标准化我的数据框有 17 个特征 X 和 5 个目标 y 观察约11
Pandas：如何将数据框插入 Clickhouse

我正在尝试将 Pandas 数据框插入 Clickhouse 这是我的代码 import pandas import sqlalchemy as sa uri clickhouse default localhost default ch
揭秘sharedctypes性能

在 python 中可以在多个进程之间共享 ctypes 对象然而我注意到分配这些对象似乎非常昂贵考虑以下代码 from multiprocessing import sharedctypes as sct import ctypes
如何使用 Bokeh 动态隐藏字形和图例项

我正在尝试在散景中实现复选框其中每个复选框应显示隐藏与其关联的行我知道可以通过图例来实现这一点但我希望这种效果同时在两个图中发生此外图例也应该更新在下面的示例中出现了复选框但不执行任何操作我显然不明白如何更新用作源的数据
如何在 Django 中使用基于类的视图创建注册视图？

当我开始使用 Django 时我几乎使用 FBV 基于函数的视图来处理所有事情包括注册新用户但当我更深入地研究项目时我意识到基于类的视图通常更适合大型项目因为它们更干净且可维护但这并不是说 FBV 不是无论如何我将整个项目
Python Django-如何从输入文件标签读取文件？

我不想将文件保存在我的服务器上我只想在下一页中读取并打印该文件现在我有这个 index html
从 python 检测 macOS 中的暗模式

我正在编写一个 PyQt 应用程序我必须添加一个补丁以便在启用暗模式的 Macos 上可以读取字体 app QApplication Fix for the font colours on macos when running dark
从列表python的单个列表中删除子列表

我已经经历过从列表列表中删除子列表 https stackoverflow com questions 47209786 removing sublists from a list of lists 但当我为我的数据集扩展它时它不适用于我
[cocos2d-x]当我尝试在 Windows 10 中运行“python android-build.py -p 19 cpp-tests”时出现错误

当我尝试运行命令时python android build p cpp tests 我收到如图所示的错误在此之前我收到了另一条关于 Android SDK Tools 版本兼容性的错误消息所以我只是将 sdk 版本从 26 0 0
通过 Web 界面执行 python 单元测试

是否可以通过 Web 界面执行单元测试如果可以如何执行 EDIT 现在我想要结果对于测试我希望它们是自动化的可能每次我对代码进行更改时抱歉我忘了说得更清楚 EDIT 这个答案此时已经过时了 Use Jenkins https j
python 日志记录会刷新每个日志吗？

当我使用标准模块将日志写入文件时logging 每个日志会分别刷新到磁盘吗例如下面的代码会将日志刷新 10 次吗 logging basicConfig level logging DEBUG filename debug log fo
PHP 和 NLP：嵌套括号（解析器输出）到数组？

想要将带有嵌套括号的文本转换为嵌套数组以下是 NLP 解析器的输出示例 TOP S NP PRP I VP VBP love NP NP DT a JJ big NN bed PP IN of NP NNS roses 原文我喜欢一大床
如何使用Python保存“完整的网页”而不仅仅是基本的html

我正在使用以下代码来使用 Python 保存网页 import urllib import sys from bs4 import BeautifulSoup url http www vodafone de privat tarife r
tkinter：打开一个带有按钮提示的新窗口[关闭]

Closed 这个问题需要调试细节 help minimal reproducible example 目前不接受答案用户如何按下 tkinter GUI 中的按钮来打开新窗口我只需要非常简单的解决方案如果代码也能被解释那就太好了这

随机推荐

如何向 GCP 中的 dataproc 集群添加 jar 依赖项？

特别是如何添加 Spark bigquery connector 以便可以从 dataproc 的 Jupyter Web 界面中查询数据关键链接 https github com GoogleCloudPlatform spark b
获取乳胶输出块的高度

我正在尝试确定如何获得乳胶块的高度output not整个文件以及not代码而是一个block of output 作为我想要完成的一个例子我有乳胶代码 sum i 0 infty frac 1 n gt infty newline
删除不再位于远程的跟踪分支

有没有一种简单的方法可以删除远程等效项不再存在的所有跟踪分支 Example 分支机构本地和远程 master 起源主人起源错误修复 a 起源错误修复 b 起源错误修复 c 在本地我只有一个主分支现在我需要努力错误修复a 所
如何删除布局和背景之间的空间？

我拥有的是一个相对布局其中包含其他两个相对布局每个布局都有图像我已将每个图像作为其相对布局的背景但我仍然可以看到图像布局和整个父布局之间的空间那么我怎样才能删除这个空间呢这是我的 XML 代码
Chrome v41+ 显示性能问题：无；在很多节点上

我最近注意到 Chrome 在申请时出现呕吐现象display none 到很多节点 CodePen 示例 http codepen io mattdietsche pen JomjWx 在上面的 CodePen 中您可以看到切换时的滞后
基础5和页面打印

我正在使用 Zurb 基金会我试图完全按照大屏幕中的外观打印页面但所有内容都堆积起来并且浮动错误通过将 Foundation min css 中出现的每个屏幕替换为打印屏幕我成功地在打印页面中添加了网格问题是现在取的格子
在多租户架构中为每个租户将异常记录在单独的文件中

我有一个支持多租户的应用程序即一台服务器和多个数据库每个租户都有单独的数据库应用程序中引发的所有异常都将记录在一个日志中租户 ID 将与异常一起打印我想在单独的文件中处理它即为每个租户一个单独的日志文件这将有助于确定此异常是由
如何从网页复制特定元素

我的目标是从网页中获取特定的文本区域想象一下就好像您能够在页面上的任何位置绘制一个矩形并且该矩形中的所有内容都将被复制到剪贴板中我正在使用 FireBug 请随意建议其他解决方案我已经搜索了插件或书签但没有找到任何有用的东西及
Android 应用程序白标 [关闭]

Closed 这个问题是基于意见的 help closed questions 目前不接受答案我正在尝试寻找为 Android 应用程序添加白色标签的最佳方法基本上我希望能够构建几乎相同应用程序的多个版本每个版本将具有不同的资源例如
多个程序集中相同的完全限定类名

当我们在两个不同的程序集中定义相同的命名空间和类名而不使用 extern 别名来帮助它决定时 NET 编译器以及运行时的 CLR 如何确定要使用哪个类考虑这个示例全部都在一个 cs 文件中 using System namespace
序列化包含 char* 的结构

我在序列化 char 字符串时遇到错误error C2228 left of serialize must have class struct union我可以使用 std string 然后从中获取 const char 但我需要 cha
如何从 AWS 中的 Athena 获取结果格式 JSON？

我想从 AWS 中的 Athena 获取结果值格式 JSON 当我从 Athena 中选择时结果格式如下 test value report 1 test report 2 normal report 3 hard 有没有办法获得 JSO
如何从 AAssetManager 获取 std::basic_istream？

我正在使用 NDK 我需要读取资源媒体文件因此据我了解为了访问我需要使用的资源AAssetManager 最终我需要得到std basic istream与它一起工作那么问题来了如何获得std basic istream from
PHP 中有“nullsafe 运算符”吗？

有没有办法使用某种方式编写以下语句安全导航操作员 http docs codehaus org display GROOVY Operators Operators SafeNavigationOperator echo data gt g
最上面的“固定”位置 div 与非位置 div 一起移动

考虑以下代码 div width 100 height 64px border 1px solid 000 top fixed position fixed middle fixed position fixed top 64px bott
PHP imagick或任何其他工具，如何检测gif文件上是否有可见的透明度

我正在开发一项可以将 gif 文件转换为 mp4 文件的服务使用ffmpeg 我的问题是有些 gif 有visible当我将它们转换为 mp4 视频时透明区域最终会变成白色为了避免这个问题我正在尝试检测 gif 是否有visible
发送 HTTP 标头后服务器无法设置状态 - web api CORS

好吧我已经为这件事拼尽全力了设置我有一个设置了基本身份验证的 Web Api 2 0 项目我在 web config 中启用了 CORS 我有 ELMAH 日志记录错误我有一个 DelegatingHandler 处理传入的请求
如何获得一个 ID 来区分类的不同实例？

假设我有一个类有两个实例 MyClass a new MyClass MyClass b new MyClass MyClass 有一个方法 PrintUniqueInstanceID void PrintUniqueInstanceID
比较两个数组，删除匹配的项

我有两个数组 names and employees 其中填充了代表名称的字符串 names是二维的保存对匿名数组的引用但我们关心的数据的位置是 names i 0 我想循环遍历 names并找出哪些名字不在 employees 起初我
Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题

该模型无法记住之前训练它的标签我知道这是灾难性的遗忘但似乎没有例子或博客可以帮助解决这个问题对此最常见的反应是这个博客是这样的https explosion ai blog pseudo rehearsal catastropic

Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题

Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题 的相关文章

随机推荐

热门标签

Spacy 自定义名称实体识别 (NER)“灾难性遗忘”问题的相关文章