BERT 问答中长文本的滑动窗口

2024-03-17

我读过解释滑动窗口如何工作的帖子,但我找不到有关其实际实现方式的任何信息。

据我了解,如果输入太长,可以使用滑动窗口来处理文本。

如果我错了,请纠正我。 说我有一条短信“2017 年 6 月,Kaggle 宣布注册用户突破 100 万”.

给定一些stride and max_len,输入可以被分割成具有重叠单词的块(不考虑填充)。

In June 2017 Kaggle announced that # chunk 1
announced that it passed 1 million # chunk 2
1 million registered users # chunk 3

如果我的问题是“Kaggle什么时候宣布的” and “有多少注册用户”我可以用chunk 1 and chunk 3 and not use chunk 2 at all在模型中。不太确定我是否应该继续使用chunk 2训练模型

所以输入将是:[CLS]when did Kaggle make the announcement[SEP]In June 2017 Kaggle announced that[SEP] and [CLS]how many registered users[SEP]1 million registered users[SEP]


然后,如果我有一个没有答案的问题,我是否将其与所有块一起输入模型中,并将起始索引和结束索引指示为-1?例如“猪会飞吗?”

[CLS]can pigs fly[SEP]In June 2017 Kaggle announced that[SEP]

[CLS]can pigs fly[SEP]announced that it passed 1 million[SEP]

[CLS]can pigs fly[SEP]1 million registered users[SEP]


正如评论中所建议的,我尝试运行squad_convert_example_to_features (源代码 https://github.com/huggingface/transformers/blob/1af58c07064d8f4580909527a8f18de226b226ee/src/transformers/data/processors/squad.py#L134)来调查我上面遇到的问题,但它似乎不起作用,也没有任何文档。这好像是run_squad.py来自拥抱脸的用途squad_convert_example_to_featuress in example.

from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_example_to_features
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features

FILE_DIR = "."

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)

features = squad_convert_example_to_features(
    example=examples[0],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)

我得到了错误。

100%|██████████| 1/1 [00:00<00:00, 159.95it/s]
Traceback (most recent call last):
  File "<input>", line 25, in <module>
    sub_tokens = tokenizer.tokenize(token)
NameError: name 'tokenizer' is not defined

该错误表明没有tokenizers但它不允许我们通过tokenizer。不过,如果我在调试模式下的函数内部添加标记器,它确实可以工作。那么我到底该如何使用squad_convert_example_to_features功能?


我认为你举的例子有问题。两个都team_convert_examples_to_features https://github.com/huggingface/transformers/blob/1af58c07064d8f4580909527a8f18de226b226ee/src/transformers/data/processors/squad.py#L273 and team_convert_example_to_features https://github.com/huggingface/transformers/blob/1af58c07064d8f4580909527a8f18de226b226ee/src/transformers/data/processors/squad.py#L86实施滑动窗口方法是因为squad_convert_examples_to_features只是一个并行化包装器squad_convert_example_to_features。但让我们看一下单个示例函数。首先你需要打电话小队_convert_example_to_features_init https://github.com/huggingface/transformers/blob/1af58c07064d8f4580909527a8f18de226b226ee/src/transformers/data/processors/squad.py#L268使分词器成为全局的(这是自动为您完成的)squad_convert_examples_to_features):

from transformers.data.processors.squad import SquadResult, SquadV1Processor, SquadV2Processor, squad_convert_examples_to_features, squad_convert_example_to_features_init
from transformers import AutoTokenizer, AutoConfig, squad_convert_examples_to_features

FILE_DIR = "."

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
squad_convert_example_to_features_init(tokenizer)

processor = SquadV2Processor()
examples = processor.get_train_examples(FILE_DIR)

features = squad_convert_example_to_features(
    example=examples[0],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)
print(len(features))

Output:

1

您可能会说这个函数没有使用滑动窗口方法,但这是错误的,因为您的示例不需要拆分:

print(len(examples[0].question_text.split()) + len(examples[0].doc_tokens))

Output:

115

它小于您设置为 384 的 max_seq_length。现在让我们尝试另一种:

print(len(examples[129603].question_text.split()) + len(examples[129603].doc_tokens))

features = squad_convert_example_to_features(
    example=examples[129603],
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
)
print(len(features))

Output:

454
3

您现在可以将其与原始示例进行比较:

print('[CLS]' + examples[129603].question_text + '[SEP]' + ' '.join(examples[129603].doc_tokens) + '[SEP]')

for idx, f in enumerate(features):
    print('Split {}'.format(idx))
    print(' '.join(f.tokens))

Output:

[CLS]How often is hunting occurring in Delaware each year?[SEP]There is a very active tradition of hunting of small to medium-sized wild game in Trinidad and Tobago. Hunting is carried out with firearms, and aided by the use of hounds, with the illegal use of trap guns, trap cages and snare nets. With approximately 12,000 sport hunters applying for hunting licences in recent years (in a very small country of about the size of the state of Delaware at about 5128 square kilometers and 1.3 million inhabitants), there is some concern that the practice might not be sustainable. In addition there are at present no bag limits and the open season is comparatively very long (5 months - October to February inclusive). As such hunting pressure from legal hunters is very high. Added to that, there is a thriving and very lucrative black market for poached wild game (sold and enthusiastically purchased as expensive luxury delicacies) and the numbers of commercial poachers in operation is unknown but presumed to be fairly high. As a result, the populations of the five major mammalian game species (red-rumped agouti, lowland paca, nine-banded armadillo, collared peccary, and red brocket deer) are thought to be quite low (although scientifically conducted population studies are only just recently being conducted as of 2013). It appears that the red brocket deer population has been extirpated on Tobago as a result of over-hunting. Various herons, ducks, doves, the green iguana, the gold tegu, the spectacled caiman and the common opossum are also commonly hunted and poached. There is also some poaching of 'fully protected species', including red howler monkeys and capuchin monkeys, southern tamanduas, Brazilian porcupines, yellow-footed tortoises, Trinidad piping guans and even one of the national birds, the scarlet ibis. Legal hunters pay very small fees to obtain hunting licences and undergo no official basic conservation biology or hunting-ethics training. There is presumed to be relatively very little subsistence hunting in the country (with most hunting for either sport or commercial profit). The local wildlife management authority is under-staffed and under-funded, and as such very little in the way of enforcement is done to uphold existing wildlife management laws, with hunting occurring both in and out of season, and even in wildlife sanctuaries. There is some indication that the government is beginning to take the issue of wildlife management more seriously, with well drafted legislation being brought before Parliament in 2015. It remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments, and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of wanton consumption to one of sustainable management.[SEP]
Split 0
[CLS] how often is hunting occurring in delaware each year ? [SEP] there is a very active tradition of hunting of small to medium - sized wild game in trinidad and tobago . hunting is carried out with firearms , and aided by the use of hounds , with the illegal use of trap guns , trap cages and s ##nare nets . with approximately 12 , 000 sport hunters applying for hunting licence ##s in recent years ( in a very small country of about the size of the state of delaware at about 512 ##8 square kilometers and 1 . 3 million inhabitants ) , there is some concern that the practice might not be sustainable . in addition there are at present no bag limits and the open season is comparatively very long ( 5 months - october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , [SEP]
Split 1
[CLS] how often is hunting occurring in delaware each year ? [SEP] october to february inclusive ) . as such hunting pressure from legal hunters is very high . added to that , there is a thriving and very lucrative black market for po ##ache ##d wild game ( sold and enthusiastically purchased as expensive luxury del ##ica ##cies ) and the numbers of commercial po ##ache ##rs in operation is unknown but presumed to be fairly high . as a result , the populations of the five major mammalian game species ( red - rum ##ped ago ##uti , lowland pac ##a , nine - banded arm ##adi ##llo , collar ##ed pe ##cca ##ry , and red brock ##et deer ) are thought to be quite low ( although scientific ##ally conducted population studies are only just recently being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to [SEP]
Split 2
[CLS] how often is hunting occurring in delaware each year ? [SEP] being conducted as of 2013 ) . it appears that the red brock ##et deer population has been ex ##ti ##rp ##ated on tobago as a result of over - hunting . various heron ##s , ducks , dove ##s , the green i ##gua ##na , the gold te ##gu , the spectacle ##d cai ##man and the common op ##oss ##um are also commonly hunted and po ##ache ##d . there is also some po ##achi ##ng of ' fully protected species ' , including red howl ##er monkeys and cap ##uchi ##n monkeys , southern tam ##and ##ua ##s , brazilian por ##cup ##ines , yellow - footed tor ##to ##ises , trinidad pip ##ing gu ##ans and even one of the national birds , the scarlet ib ##is . legal hunters pay very small fees to obtain hunting licence ##s and undergo no official basic conservation biology or hunting - ethics training . there is presumed to be relatively very little subsistence hunting in the country ( with most hunting for either sport or commercial profit ) . the local wildlife management authority is under - staffed and under - funded , and as such very little in the way of enforcement is done to uphold existing wildlife management laws , with hunting occurring both in and out of season , and even in wildlife san ##ct ##uaries . there is some indication that the government is beginning to take the issue of wildlife management more seriously , with well drafted legislation being brought before parliament in 2015 . it remains to be seen if the drafted legislation will be fully adopted and financially supported by the current and future governments , and if the general populace will move towards a greater awareness of the importance of wildlife conservation and change the culture of want ##on consumption to one of sustainable management . [SEP]

如果我的问题是“Kaggle 何时宣布这一消息”以及“如何 许多注册用户”我可以使用块 1 和块 3,但不使用块 模型中总共有 2 个。不太确定我是否应该仍然使用块 2 训练模型

是的,您还应该使用块 2 来训练您的模型,因为当您尝试预测相同的序列时,您希望您的模型预测 0:0 作为块 2 的答案范围(即您可以轻松选择包含答案的块)。

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

BERT 问答中长文本的滑动窗口 的相关文章

  • 决策树和规则引擎 (Drools)

    In the application that I m working on right now I need to periodically check eligibility of tens of thousands of object
  • WCF 服务主机配置 - 请尝试将 HTTP 端口更改为 8732

    我的 PC 上运行着一个复杂的基于 WCF 服务的解决方案 但由于安装 Windows 8 1 时出现问题 我不得不 刷新 我的 PC 现在我已经重新安装了 Visual Studio 2012 我的项目不再正常运行 当我调试单元测试时 w
  • Spark scala 模拟 Spark.implicits 用于单元测试

    当尝试使用 Spark 和 Scala 简化单元测试时 我使用 scala test 和mockito scala 以及mockito Sugar 这只是让你做这样的事情 val sparkSessionMock mock SparkSes
  • Chrome 调试器注入 javascript

    我有这样的好奇心 是否可以以某种方式在我的页面中注入 javascript 并执行它并调试它 正如您在控制台中所做的那样 但在控制台中您无法暂停并观察变量 是否可以调试我通过控制台输入的代码 为什么无法调试通过 XHR 接收的代码 Than
  • mybatis:使用带有 XML 配置的映射器接口作为全局参数

    我喜欢使用 XML 表示法来指定全局参数 例如连接字符串 我也喜欢 Mapper 注释 当我尝试将两者结合起来时 我得到这个例外 https stackoverflow com questions 4263832 type interfac
  • 使用 Ruby aws-sdk 跟踪文件到 S3 的上传进度

    首先 我知道SO中有很多与此类似的问题 在过去的一周里 我读了大部分 如果不是全部 但我仍然无法让这项工作为我工作 我正在开发一个 Ruby on Rails 应用程序 允许用户将 mp3 文件上传到 Amazon S3 上传本身工作正常
  • Matplotlib loglog 的错误刻度/标签(双轴)

    我正在使用 matplotlib 创建对数图 如下图所示 默认刻度选择得很糟糕 充其量是这样 右边的 y 轴甚至根本没有 在线性等效中确实如此 而两个 x 轴都只有一个 有没有办法获得合理数量的带有标签的刻度 without为每个情节手动指
  • 将客户端库添加到 Razor 类库

    我正在学习 Blazor 我注意到创建 Razor 类库是一个很好的做法 您将在其中定义大部分组件 这样您就可以在客户端或服务器中使用它们 而不会出现太多问题 在不同的框架中 我习惯于以 SASS 形式包含库作为引导程序 这样我就可以在我的
  • 区分 NaN 输入和输入类型为“number”的空输入

    我想使用 type number 的表单输入 并且只允许输入数字
  • $ 在 JQuery 中意味着什么

    在下面的 var obj one 1 two 2 three 3 four 4 five 5 each obj function i val console log val 这里是什么意思 是对象吗 是一个别名jQuery对象 函数 它充当
  • 在DialogFragment中,onCreate应该做什么?

    我目前正在摆弄 DialogFragment 以学习使用它 我假设相比onCreateView onCreate 可以这样做 public void onCreate Bundle savedInstanceState super onCr
  • 在成为FirstResponder或resignFirstResponder的情况下将对象保持在键盘顶部?

    我目前在键盘顶部有一个 UITextField 当您点击它时 它应该粘在键盘顶部并平滑地向上移动 我不知道键盘的具体时长和动画类型 所以确实很坎坷 这是我所拥有的 theTextView resignFirstResponder UIVie
  • 水平和垂直居中 div 位于页面中间,页眉和页脚粘在页面顶部和底部

    我正在尝试制作一个具有固定高度页眉和页脚的页面 页眉位于屏幕顶部 100 宽度 页脚位于底部 100 宽度 我想将一个具有可变高度内容的 div 居中放置在页眉和页脚之间的空间中 在下面的 jsfiddle 中 如果内容比空格短 它会起作用
  • 是否可以使用 Dapper 流式传输大型 SQL Server 数据库结果集?

    我需要从数据库返回大约 500K 行 请不要问为什么 然后 我需要将这些结果保存为 XML 更紧急 并将该文件通过 ftp 传输到某个神奇的地方 我还需要转换结果集中的每一行 现在 这就是我正在做的事情 TOP 100结果 使用 Dappe
  • 描述符“join”需要“unicode”对象,但收到“str”

    代码改编自here http wiki geany org howtos convert camelcase from foo bar to Foo Bar def lower case underscore to camel case s
  • 使用 paramiko 运行 Sudo 命令

    我正在尝试执行sudo使用 python paramiko 在远程计算机上运行命令 我尝试了这段代码 import paramiko ssh paramiko SSHClient ssh set missing host key polic
  • 对象指针值作为字典的键

    我想使用对象的引用值作为字典的键 而不是对象值的副本 因此 我本质上想在字典中存储与另一个对象的特定实例关联的对象 并稍后检索该值 这可能吗 是不是完全违背了NSDictionary的理念 我可以看出我可能以错误的方式处理这个问题 因为字典
  • HTML 表格 - 固定列宽和多个可变列宽

    我必须建立一个有 5 列的表 表格宽度是可变的 内容宽度的 50 有些列包含固定大小的按钮 因此这些列应该有一个固定大小 例如 100px 有些列中有文本 所以我希望这些列具有可变的列宽 例如 Column1 tablewidth sum
  • 通过 Telnet 运行应用程序

    我需要创建一个 BAT 文件来通过 telnet 运行应用程序 但据我所知 在 DOS 上无法执行此操作 Telnet 不允许在连接的瞬间向远程计算机发送任何命令 并且 BAT 文件中的每个后续命令只有在 telnet 停止后才会执行 这段
  • R data.table 1.9.2 关于 setkey 的问题

    这似乎是 1 8 10 后引入的一个错误 与包含列表的 DT 的 setkey 相关 运行下面两个代码来查看问题 library data table dtl lt list dtl 1 lt data table scenario 1 p

随机推荐

  • UICollectionView 不调用委托方法

    我已经设置了一个UICollectionView在故事板中并连接数据源和委托出口 我已经通过笔尖注册了一个单元格 我将其出队cellForItemAtIndexPath 方法 所有工作都完美地期望委托方法永远不会被调用 例如 当触摸一个单元
  • 如何在 Javascript 中实现应用模式

    Javascript 中引用函数调用模式的应用调用模式是什么 我该如何使用它 使用这种调用模式有什么好处 指某东西的用途apply与函数上下文相关 this关键字 和参数传递 首先 我想你应该知道在什么情况下this关键字是隐含地 set
  • 如何从 ~/.aws/config 加载配置

    变更日志 https github com aws aws sdk js blob master CHANGELOG md 2440 says Load config from aws config if AWS SDK LOAD CONF
  • ListField 在编辑/创建帖子中显示

    我正在一个项目中使用 Flask mongoengine 我正在尝试从中获取基本的东西http docs mongodb org manual tutorial write a tumblelog application with flas
  • 如何在MASM中为一个项目编写和组合多个源文件?

    对于组装来说还是个新手 玩起来很有趣 我想将程序的功能拆分到多个文件中 特别是通过将类似的功能分组在一起进行组织 这些其他文件将由主文件 甚至希望其他非主文件 调用 我还没有成功 希望得到帮助 我不使用 IDE 更喜欢使用 notepad
  • Excel VBA 用于匹配和排列行

    我有一个 Excel 文档 其中包含 A 到 J 列 K 到 N 列包含相关数据 但未对齐 我需要将 F 列中的值与 K 列中的值进行匹配 以便它们对齐 当我移动K时 我必须一起移动L M N 我无法对 A 到 J 列进行排序 它们必须保留
  • JWT 计算签名 SHA256withRSA

    我试图 使用 SHA256withRSA 对输入的 UTF 8 表示形式进行签名 也可以 称为带有 SHA 256 哈希函数的 RSASSA PKCS1 V1 5 SIGN 从API控制台获取的私钥 输出将是 字节数组 所以让我们将 Hea
  • Django Count 和 Sum 注释相互干扰

    在建设综合体的同时QuerySet通过几个注释 我遇到了一个可以通过以下简单设置重现的问题 以下是型号 class Player models Model name models CharField max length 200 class
  • WPF 工具包数据网格标头和空源

    如果没有可显示的行 如何使数据网格显示标题 我的数据网格是 完全 只读的 自动生成列 真 CanUserAddRows 假 CanUserDeleteRows 假 CanUserResizeRows 假 IsReadOnly 真 如果没有行
  • 将 int 数组转换为 char*

    这可能吗 我想将其转换为 char 以便稍后检索该值 Sure int array 4 1 2 3 4 char c reinterpret cast
  • Github/compare:如何比较两个不同的文件(不同的文件名,都在 HEAD 中)?

    我可以使用github吗 compare在HEAD中显示两个不同源文件之间的差异 这是我在命令行上执行此操作的方法git diff git diff HEAD docs tutorial 01 boxed function pointers
  • 预编译资产时出现“命令失败,状态为 ()”

    当我在生产环境中编译资源时 我和很多用户一样遇到了类似的问题 唯一的区别是我无法从跟踪中得到任何提示来解决问题 rake assets precompile RAILS ENV production trace Invoke assets
  • 比较字典,更新而不覆盖值[重复]

    这个问题在这里已经有答案了 I am not寻找这样的东西 如何将两个字典合并到一个表达式中 https stackoverflow com questions 38987 how do i merge two dictionaries i
  • 如何使用 CSS3 将链接列表拆分为 3 列?

    我有一个这样的链接列表 div a href link html Dummy link text a a href link html Dummy link text a a href link html Dummy link text a
  • 我们能否将大的 ajax 调用拆分为多个较小的调用以更快地加载数据?

    我使用下面的 ajax 调用从数据库检索数据并将其显示在我的页面中 ajax type POST url MyPage aspx LoadGrid data idyear 2020 contentType application json
  • 是否有针对 Android 设备的特定于设备的错误汇编?

    我们发现特定设备上的崩溃发生在某些标准 Java 库内 此崩溃仅发生在一种特定型号的设备上 我确信各种 Android 设备还有其他奇怪的地方 是否有 Android 设备已知问题的汇编 我特别在想所提供的库和 JVM 类型内容中的奇怪之处
  • 如果 UserAgent 是 iPhone,则动态更改 url 或 WordPress 主题

    有什么办法可以做到这一点吗 我的网站是http kennethreitz com http kennethreitz com 它由顶级 WordPress 上的一些中等 PHP 驱动 我的选择是检测用户是否使用 iPhone 以及是否使用
  • 打字稿和运算符

    我正在努力寻找定义 TypeScript 中的运算符 我最近遇到了以下代码 type IRecord
  • jQuery UI 滑块(以编程方式设置)

    我想即时修改滑块 我试图通过使用来做到这一点 slider slider option values 50 80 此调用将设置值 但元素不会更新滑块位置 呼唤 slider trigger change 也没有帮助 是否有另一种 更好的方法
  • BERT 问答中长文本的滑动窗口

    我读过解释滑动窗口如何工作的帖子 但我找不到有关其实际实现方式的任何信息 据我了解 如果输入太长 可以使用滑动窗口来处理文本 如果我错了 请纠正我 说我有一条短信 2017 年 6 月 Kaggle 宣布注册用户突破 100 万 给定一些s