SentencePiece,subword-nmt,bpe算法

2023-11-04

BPE(Byte Pair Encoding,双字节编码)。2016年应用于机器翻译,解决 集外词(OOV)和罕见词(Rare word)问题。论文题目《Neural Machine Translation of Rare Words with Subword Units》 —发表于ACL2016

http://www.sohu.com/a/115373230_465975

tensor2tensor有用到bpe,抽取:

data_generators/problem.py

data_generators/translate_ende.py

bpe算法实现:

1.参考:https://plmsmile.github.io/2017/10/19/subword-units/

import re
def process_raw_words(words, endtag='-'):
    '''把单词分割成最小的符号,并且加上结尾符号'''
    vocabs = {}
    for word, count in words.items():
        # 加上空格
        word = re.sub(r'([a-zA-Z])', r' \1', word)
        word += ' ' + endtag
        vocabs[word] = count
    return vocabs

def get_symbol_pairs(vocabs):
    ''' 获得词汇中所有的字符pair,连续长度为2,并统计出现次数
    Args:
        vocabs: 单词dict,(word, count)单词的出现次数。单词已经分割为最小的字符
    Returns:
        pairs: ((符号1, 符号2), count)
    '''
    #pairs = collections.defaultdict(int)
    pairs = dict()
    for word, freq in vocabs.items():
        # 单词里的符号
        symbols = word.split()
        for i in range(len(symbols) - 1):
            p = (symbols[i], symbols[i + 1])
            pairs[p] = pairs.get(p, 0) + freq
    return pairs

def merge_symbols(symbol_pair, vocabs):
    '''把vocabs中的所有单词中的'a b'字符串用'ab'替换
    Args:
        symbol_pair: (a, b) 两个符号
        vocabs: 用subword(symbol)表示的单词,(word, count)。其中word使用subword空格分割
    Returns:
        vocabs_new: 替换'a b'为'ab'的新词汇表
    '''
    vocabs_new = {}
    raw = ' '.join(symbol_pair)
    merged = ''.join(symbol_pair)
    # 非字母和数字字符做转义
    bigram =  re.escape(raw)
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word, count in vocabs.items():
        word_new = p.sub(merged, word)
        vocabs_new[word_new] = count
    return vocabs_new

raw_words = {"low":5, "lower":2, "newest":6, "widest":3}
vocabs = process_raw_words(raw_words)

num_merges = 10
print(vocabs)
for i in range(num_merges):
    pairs = get_symbol_pairs(vocabs)
    # 选择出现频率最高的pair
    symbol_pair = max(pairs, key=pairs.get)
    vocabs = merge_symbols(symbol_pair, vocabs)
print(vocabs)

输出:

原来:{"low":5, "lower":2, "newest":6, "widest":3}
经过bpe:{' low-': 5, ' low e r -': 2, ' newest-': 6, ' wi d est-': 3}

{“low”:5, “lower”:2, “newest”:6, “widest”:3}这个是原本每个单词出现的频率。最后输出,可以以空格为划分,比如作为建模单元,比如这里的建模单元为 low e r newest wi d est 。输出文本经过建模单元就能都映射出来,一串表示。


2.参考 《Neural Machine Translation of Rare Words with Subword Units》

论文讲解:http://www.sohu.com/a/115373230_465975

import re, collections
def get_stats(vocab):
     pairs = collections.defaultdict(int)
     for word, freq in vocab.items():
       symbols = word.split()
       print(symbols)
       print("len(symbols)     ---   ",len(symbols))
       for i in range(len(symbols)-1):
         pairs[symbols[i],symbols[i+1]] += freq
     return pairs
def merge_vocab(pair, v_in):
     v_out = {}
     bigram = re.escape(' '.join(pair))
     print("bigram    ",bigram)
     p = re.compile(r'(?     for word in v_in:
       w_out = p.sub(''.join(pair), word)
       print("w_out    ",w_out)
       v_out[w_out] = v_in[word]
     return v_out
     
vocab = {'l o w ' : 5, 'l o w e r ' : 2,
'n e w e s t ':6, 'w i d e s t ':3}
num_merges = 10

for i in range(num_merges):
   print("=#####################################=== ")
   pairs = get_stats(vocab)
   print("===========11111======= ")
   print(pairs)
   #print("===========11111======= ")
   
   best = max(pairs, key=pairs.get)
   print("===========2222======= ")
   print("pairs.get   ",pairs.get)
   print("best   ",best)
   #raise SystemExit
   vocab = merge_vocab(best, vocab)
   print("vocab   ",vocab)

个人觉得分词最好用的还是sentencepiece~~

SentencePiece

参考https://github.com/google/sentencepiece/tree/master/python

分词20k个label id

>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=/data/yelong/bpe_test/lib.txt --model_prefix=/data/yelong/bpe_test/bpe --vocab_size=20000 --model_type=bpe') 

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("/data/yelong/bpe_test/bpe.model")
with open('/data/yelong/bpe_test/wav/train/text.txt', 'a') as fid, open('/data/yelong/bpe_test/wav/train/train.txt') as did:
    for line in did:
        a = line.strip().split()[1:]  # eg. "TWO COME MUSE MIGRATE"
        aa = ' '.join([t for t in a])
        listid = sp.EncodeAsIds(aa)
        strid = ' '.join([str(t) for t in listid])
        b = line.strip().split()[:1]
        b =''.join([t for t in b])
        fid.write(b+' '+strid+'\n')

得到.model和.vocab两个文件,

bpe.vocab:

<unk>   0
<s>     0
</s>    0
▁T      -0
HE      -1
▁A      -2
▁THE    -3
IN      -4
▁S      -5
▁W      -6

一个映射关系,右边并不是id号,因为model_type有好几种(unigram (default), bpe, char, or word),当选择比如unigram种类时,得到的右边是小数,所以并不是id号。

所以我不应该把nabu里配置里的alphabet里只写了0-19996(bpe.vocab末尾是19996),而应该写0-19999才对。

验证过了,0-19999的id都有对应的piece,验证方法:

% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("/data/yelong/bpe_test/bpe.model")
>>> for i in range(20000):
...     sp.IdToPiece(i)

都能输出。(不能输出的话会报错,退出)

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

SentencePiece,subword-nmt,bpe算法 的相关文章

  • Unity3D集成腾讯语音GVoice SDK

    简述 我们项目中用到了实时语音功能 在最初语音 SDK 技术选型的时候测试过融云 声网和腾讯的 GVoice 融云和声网我都在我们项目中使用过 但是效果都不如王者荣耀游戏中的实时语音效果 这两天好好研究了一下腾讯的 GVoice 终于成功集
  • Coqui TTS 安装与测试

    前言 本篇记录一下 Coqui TTS 的安装 Coqui TTS 的主要作者是德国人 这个库似乎之前和 Mozilla 的 TTS https github com mozilla TTS 有千丝万缕的关系 但是现在后者的 TTS 已经停
  • Python NLTK 自然语言处理入门与例程

    Python NLTK 自然语言处理入门与例程 在这篇文章中 我们将基于 Python 讨论自然语言处理 NLP 本教程将会使用 Python NLTK 库 NLTK 是一个当下流行的 用于自然语言处理的 Python 库 那么 NLP 到
  • 文本转语音的接口(开放免费)

    百度的开放转换接口 http tts baidu com text2audio lan zh ie UTF 8 spd 4 text 你好啊 听起来好憨啊 lan 语言类型 lan en 英文 lan zh 中文 ie 文字编码方式 spd
  • BES 的蓝牙串口SPP数据收发实验

    1
  • python3.6 安装pyltp

    当时安装pyltp一直报错 缺少balabala 安装了都没有什么暖用 so成功后安利一下 1 安装wheel 下面两个文件针对不同的python版本下载一个即可 pyltp 0 2 1 cp35 cp35m win amd64 whl p
  • GMM-HMM在语音识别中的应用

    1 语音识别系统的基本结构 2 涉及算法 3 GMM高斯混合模型 3 1高斯混合模型的基本概念 高斯混合模型是指具有如下形式的概率分布模型 p y k 1k k y k p y arrowvert theta sum k 1 k alpha
  • 华为太长脸了,扔出“3张王炸”!再次颠覆互联网行业

    8月29日 在没有任何预告 任何发布会的情况下 华为扔出一张 王炸 华为新手机Mate60pro开售 9月8日 华为废话不多说 再次扔出 王炸 牌 Mate60 Pro MateX5开启预订 一经开抢 卖到断货 就在刚刚 华为AITO问界新
  • 爱好高科技之人脸识别模块

    前段时间看到一款性价比很不错的人脸识别模组 2个关键指标引起了我极大的兴趣 1 99 的识别通过率 误识率低于百万分之一 2 双目摄像头 活体检测 于是买了几个 结合离线语音模块 两者通过串口进行一问一答通信 人机交互部分通过语音和OLED
  • 语音识别入门 --各个模型的整理

    语音识别入门 前期知识储备 结构图 语音特征提取 各个模型的尝试 ASRT DeepSpeechRecognition end2end chinese speech recognition Wenet whisper 语音纠错 TTS 文本
  • 实时语音通讯技术的应用场景与挑战

    随着互联网和移动通信技术的快速发展 实时语音通讯技术已经成为人们日常生活和工作中不可或缺的一部分 实时语音通讯技术可以让人们通过网络进行实时语音通话 不受时间和地点的限制 带来了极大的便利和效率提升 本文将探讨实时语音通讯技术的应用场景和挑
  • ICLR 2023

    PaperWeekly 原创 作者 黄融杰 单位 浙江大学 研究方向 语音翻译 语音到语音翻译 S2ST 对于打破语言壁垒与沟通障碍非常有益 传统的 S2ST 系统通常由语音识别 ASR 机器翻译 MT 和语音合成 TTS 三部分组成 与这
  • 隐马尔可夫模型(HMM)的分类

    1 遍历型 ergodic model 即每个状态都可以由任意一个状态演变而来 aij gt 0 for all i j 如图 2 left right type of HMM 每个状态只能由下标值小于当前值得状态以及其本身转移而来 即从左
  • ES词典热加载-通过修改ik分词器源码实现热加载自定义词典

    逻辑 自定义词典的数据从mysql加载 只需要重启一次ES即可 后续热加载 实现 在自定义词典的init方法中实现每隔一定时间读取mysql并写入自定义词典的逻辑
  • C# 语音识别

    在 NET4 0中 我可以借助System Speech组件让电脑来识别我们的声音 以上 当我说 你好 显示 Darren 我说 age 显示 永远21 如何做呢 首先要开启电脑的语音识别功能 右键电脑右下方的扬声器 选择 录音设备 点击默
  • 腾讯智影+IDM进行数字人制作(无限使用)

    腾讯智影 IDM进行数字人制作 无限使用 首先确保您电脑上有windows自带的Edge浏览器 安装IDM下载工具 IDM 安装到浏览器扩展 下载IDM使用工具 链接 https pan baidu com s 1iARibnICpbnOR
  • iOS系统语音识别

    iOS10语音识别框架Speech 项目中用到语音识别功能 这里简单的进行了一下封装 大概实现了系统语音识别的功能 还没测试 应该会有很多坑 语音识别功能封装 系统的语音识别 外部语音输入 实现语音转文字功能 项目地址 https gith
  • 语音识别-3

    https zhuanlan zhihu com p 33464788 基于CTC的语音识别基础与实现 首先明确语音识别的任务是怎样的 输入input是音频wav文件 保存的一般是经过抽样量化编码之后数字信号 也就是每个样点的值 即我们经常
  • 机器翻译:跨越语言边界的智能大使

    导言 机器翻译作为人工智能领域的瑰宝 正在以前所未有的速度和精度 为全球沟通拓展新的可能性 本文将深入研究机器翻译的技术原理 应用场景以及对语言交流未来的影响 1 简介 机器翻译是一项致力于通过计算机自动将一种语言的文本翻译成另一种语言的技
  • 大语言模型:开启自然语言处理新纪元

    导言 大语言模型 如GPT 3 Generative Pre trained Transformer 3 标志着自然语言处理领域取得的一项重大突破 本文将深入研究大语言模型的基本原理 应用领域以及对未来的影响 1 简介 大语言模型是基于深度

随机推荐

  • 哈希结构(图文详解)【哈希表,哈希桶,位图,布隆过滤器】

    哈希结构 哈希概念 常见的K V结构 实现了元素关键码与元素值的映射关系 但没有实现元素关键值与元素存储位置的映射关系 在遍历过程中 一般的顺序表或搜索二叉树要进行关键值的多次比较 其中顺序表的时间复杂度为O n 二叉搜索树的时间复杂度O
  • phantomjs实现html生成pdf

    phantomjs实现html生成pdf 实现比较简单 同时能够实现对页面的完全展示成pdf 但是生成的时间比较长且并发很差 很容易直接挂掉 以下是实现 1 下载phantomjs 2 1 1 windows并解压到本地路劲 2 实现jav
  • php接口post数据接收不到参数原因

    1 检查头信息content type是不是为 content type application x www form urlencoded 这种传输是以表单的方式提交数据php使用 POST方式接受 2 如果头信息content type
  • python与mongodb交互-->pymongo

    from pymongo import MongoClient 创建数据库连接对象 client MongoClient ip 27017 选择一个数据库 db client admin db authenticate python pyt
  • c++学习——类和对象

    类和对象 类和对象的基本概念 类的封装 尽量把成员属性设置为私有的 小练习 结构体和类的区别 圆的周长类案例 学生类的案例 汽车案例 立方体案例 点和圆案例 类和对象的基本概念 类是自定义数据类型 是C语言的结构体进化而成的 对象是类实例化
  • 硬件笔记(一)——DCDC典型电路分析

    此次小记分析的电路为SIM7600技术手册的推荐外部电源电路 LM2596内部包含150KHZ振荡器 输入电压范围最高可达40V 4 5V 40V 最高可提供3A的直流负载电流 输出电压可调范围1 23V 37V 现对此电路进行分析 1 D
  • 程序获取

    程序获取 机器学习 深度学习程序和数据获取方式 目录 程序获取 机器学习 深度学习程序和数据获取方式 程序获取方式1 程序获取方式2 程序获取方式3 程序获取方式4 程序获取方式5 程序获取方式6 程序获取方式1 私信博主或者博客底部联系博
  • jmeter线程组内的接口顺序执行解决办法

    一 压力测试场景分析 测试人员在使用jmeter对一个场景进行压力测试 在一个线程组内有多个接口时 测试人员想要测试在特定的时间段内一直循环该线程以测试服务器压力 但是jmeter在run线程组的时候 经常不会按照该线程组内的接口顺序执行
  • Ubuntu torch.cuda.is_available() 返回 False情况

    如果Ubuntu20 04 出现torch cuda is available 返回 False情况 解决方法 重新安装Pytorch Ubuntu20 04 CUDA 11 4 Pytorch配置安装 conda conda create
  • Hibernate 自动创建表

    前些天发现了一个巨牛的人工智能学习网站 通俗易懂 风趣幽默 忍不住分享一下给大家 点击跳转到教程 1 在 hibernate cfg xml 添加这句话 可以自动生成数据表
  • 开放集识别的最新进展总结(源于Recent Advances in Open Set Recognition: A Survey)

    摘要 在现实的识别 分类任务中 由于受到各种客观因素的限制 在训练一个识别器或分类器摘 0 摘要 原因与场景 在现实的识别 分类任务中 训练模型的时候可能并没有所有类别的训练集 因此 这样训练出来的模型在没有出现过的类出现时 一般会失效 解
  • 中国科学院大学工程管理与信息技术学院 2014年招收以下八个领域在职工程硕...

    中国科学院大学工程管理与信息技术学院2014年招收以下八个领域在职工程硕士 欢迎广大考生报考 一 专业领域介绍 招生领域 研究方向 学费 报考条件 学位 证书 学习方式
  • 数据结构与算法——栈的实现及模拟

    目录 一 栈的原理 二 栈的实现 1 栈的定义 2 栈的初始化 3 入栈 4 出栈 5 获取栈顶元素 6 栈的大小 7 判断栈是否为空 8 栈的销毁 一 栈的原理 堆栈 英语 stack 又称为栈或堆叠 是计算机科学中的一种抽象资料类型 只
  • Kafka核心设计与实践原理总结:进阶篇

    kafka作为当前热门的分布式消息队列 具有高性能 持久化 多副本备份 横向扩展能力 我学习了 深入理解Kafka 核心设计与实践原理总结 一书后 对其中主要的知识点进行了总结 便于理解和掌握kafka的原理和应用 在这里分享出来 希望也能
  • es常用curl命令

    说明 仅记录实验室测试过程 不作为官方文档使用 可能会有很多地方未能验证 因此无法进行技术兜底 需使用方多加验证测试 涉及到高危需走变更 目前测试版本均为651及以前版本 命令样例基于安全模式 如果是在非安全模式下 将命令中的参数 tlsv
  • .Net Core下简单的JWT黑名单中间件

    自从JWT认证方式在互联网上蔓延后 Session认证方式就被挤掉了一大半的生存空间 这里我们不讲JWT与Session两种方式的优缺点 我们只讲如何通过JWT的黑名单来阻止某些Token的登录 设置黑名单 也就是说要将Token写入某个存
  • gRPC:以 C++为例

    文章目录 1 gRPC 环境搭建 1 1 安装 cmake 1 2 安装 gcc gdb 1 3 安装 gRPC 1 4 protobuf 安装 1 5 测试环境 2 1 grpc 同步 2 1 定义服务 2 2 gRPC 服务端 2 3
  • 通讯录的实现

    ifndef TONGXUNLU H define TONGXUNLU H define MAX NAME 20 define MAX PHONE 11 define MAX PEO 1000 typedef struct PeoInfo
  • python肢体识别线条_【HUSKYLENS二哈识图】micro:bit视觉识别入门教程——06循“轨”蹈矩的麦昆...

    点击上方 蘑菇云创造 可以订阅哦 循 轨 蹈矩的麦昆 功能介绍 本项目利用 HuskyLens 的巡线功能 让麦昆 plus 按照地面上的线路轨道欢快地蹦跶 材料清单 知识园地 如果我们要让小车机器人按照地面上的线条移动 就需要一些传感器来
  • SentencePiece,subword-nmt,bpe算法

    BPE Byte Pair Encoding 双字节编码 2016年应用于机器翻译 解决 集外词 OOV 和罕见词 Rare word 问题 论文题目 Neural Machine Translation of Rare Words wit