bert第三篇：tokenizer

2023-11-03

文章目录

tokenizer基本含义

tokenizer就是分词器；只不过在bert里和我们理解的中文分词不太一样，主要不是分词方法的问题，bert里基本都是最大匹配方法。

最大的不同在于“词”的理解和定义。比如：中文基本是字为单位。
英文则是subword的概念，例如将"unwanted"分解成[“un”, “##want”, “##ed”] 请仔细理解这个做法的优点。
这是tokenizer的一个要义。

bert里涉及的tokenizer

BasicTokenzer

主要的类是BasicTokenizer，做一些基础的大小写、unicode转换、标点符号分割、小写转换、中文字符分割、去除重音符号等操作，最后返回的是关于词的数组（中文是字的数组）

 def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = self._clean_text(text)

    # This was added on November 1st, 2018 for the multilingual and Chinese
    # models. This is also applied to the English models now, but it doesn't
    # matter since the English models were not trained on any Chinese data
    # and generally don't have any Chinese data in them (there are Chinese
    # characters in the vocabulary because Wikipedia does have some Chinese
    # words in the English Wikipedia.).
    text = self._tokenize_chinese_chars(text)

    orig_tokens = whitespace_tokenize(text)
    split_tokens = []
    for token in orig_tokens:
      if self.do_lower_case:
        token = token.lower()
        token = self._run_strip_accents(token)
      split_tokens.extend(self._run_split_on_punc(token))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    return output_tokens

BasicTokenzer是预处理。

wordpiecetokenizer

另外一个则是关键wordpiecetokenizer，就是基于vocab切词。

  def tokenize(self, text):
    """Tokenizes a piece of text into its word pieces.

    This uses a greedy longest-match-first algorithm to perform tokenization
    using the given vocabulary.

    For example:
      input = "unaffable"
      output = ["un", "##aff", "##able"]

    Args:
      text: A single token or whitespace separated tokens. This should have
        already been passed through `BasicTokenizer.

    Returns:
      A list of wordpiece tokens.
    """

    text = convert_to_unicode(text)

    output_tokens = []
    for token in whitespace_tokenize(text):
      chars = list(token)
      if len(chars) > self.max_input_chars_per_word:
        output_tokens.append(self.unk_token)
        continue

      is_bad = False
      start = 0
      sub_tokens = []
      while start < len(chars):
        end = len(chars)
        cur_substr = None

        #找个单词，找不到end向前滑动；还是看代码实在！！！
        while start < end:
          substr = "".join(chars[start:end])
          if start > 0:
            substr = "##" + substr
          if substr in self.vocab:
            cur_substr = substr
            break
          end -= 1
        if cur_substr is None:
          is_bad = True
          break
        sub_tokens.append(cur_substr)
        start = end

      if is_bad:
        output_tokens.append(self.unk_token)
      else:
        output_tokens.extend(sub_tokens)
    return output_tokens

FullTokenzier

这个基本上就是利用basic和wordpiece来切分。用于bert训练的预处理。基本就一个tokenize方法。不会有encode_plus等方法。

PretrainTokenizer

这个则是bert的base类，定义了很多方法(convert_ids_to_tokens)等。后续的BertTokenzier，GPT2Tokenizer都继承自pretrainTOkenizer，下面的关系图可以看到这个全貌。

关系图

在这里插入图片描述

实操

from transformers.tokenization_bert import BertTokenizer


tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("词典大小:",tokenizer.vocab_size)
text = "the game has gone!unaffable  I have a new GPU!"
tokens = tokenizer.tokenize(text)
print("英文分词来一个：",tokens)


text = "我爱北京天安门，吢吣"
tokens = tokenizer.tokenize(text)
print("中文分词来一个：",tokens)

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("id-token转换:",input_ids)


sen_code = tokenizer.encode_plus("i like  you  much", "but not him")
print("多句子encode：",sen_code)

print("decode：",tokenizer.decode(sen_code['input_ids']))

输出结果：

词典大小: 30522
英文分词来一个： ['the', 'game', 'has', 'gone', '!', 'una', '##ffa', '##ble', 'i', 'have', 'a', 'new', 'gp', '##u', '!']
中文分词来一个： ['我', '[UNK]', '北', '京', '天', '安', '[UNK]', '，', '[UNK]', '[UNK]']
id-token转换: [1855, 100, 1781, 1755, 1811, 1820, 100, 1989, 100, 100]
多句子encode： {'input_ids': [101, 1045, 2066, 2017, 2172, 102, 2021, 2025, 2032, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
decode： [CLS] i like you much [SEP] but not him [SEP]

看代码或者实际操练一遍，再来看理论知识更好。实操是关键，是思想的体现。

当然也可以单独实验bertwordpiecetokenzer

from transformers.tokenization_bert import BertWordPieceTokenizer
# initialize tokenizer
tokenizer = BertWordPieceTokenizer(
    vocab_file= "vocab.txt",
    unk_token = "[UNK]",
    sep_token = "[SEP]",
    cls_token = "[CLS]",
    pad_token  = "[PAD]",
    mask_token = "[MASK]",
    clean_text = True,
    handle_chinese_chars = True,
    strip_accents= True,
    lowercase = True,
    wordpieces_prefix = "##"
)


# sample sentence
sentence = "Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect."

# tokenize the sample sentence
encoded_output = tokenizer.encode(sentence)
print(encoded_output)
print(encoded_output.tokens)

如何训练

其实就是提取vacab的过程。
BPE算法也比较容易理解：不断的选择most common的加入到词典，为什么？因为覆盖的语料量比较大。

举个bpe的例子。

原始统计词：
('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)

开始统计char：
('h' 'u' 'g', 10), ('p' 'u' 'g', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'u' 'g' 's', 5)

合并最大的ug：

('h' 'ug', 10), ('p' 'ug', 5), ('p' 'u' 'n', 12), ('b' 'u' 'n', 4), ('h' 'ug' 's', 5)

合并最大频度的hug：
 ['b', 'g', 'h', 'n', 'p', 's', 'u', 'ug', 'un', 'hug'] 

最后原始统计词的表示转换为：

('hug', 10), ('p' 'ug', 5), ('p' 'un', 12), ('b' 'un', 4), ('hug' 's', 5)

训练自己中文的tokenizer

def train_cn_tokenizer():
    # ! pip install tokenizers

    from pathlib import Path

    from tokenizers import ByteLevelBPETokenizer

    paths = [str(x) for x in Path("zho-cn_web_2015_10K").glob("**/*.txt")]

    # Initialize a tokenizer
    tokenizer = ByteLevelBPETokenizer()

    # Customize training
    tokenizer.train(files=paths, vocab_size=52_000, min_frequency=3, special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ])

    # Save files to disk
    tokenizer.save( ".","zh-tokenizer-train")

我强烈建议，根据自己的业务定制自己的vocab，当然要配套模型。
最后的结果

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":127,"¿":128,"À":129,"Á":130,"Â":131,"Ã":132,"Ä":133,"Å":134,"Æ":135,
...

总结

理论结合实践，敲代码仔细深度理解。
tokenzier的本质是分词，提取有意义的wordpiece，又尽可能的少，用尽量少的信息单元来描述无限的组合。
几个类的集成理清楚。
里面的细节可以继续阅读原始类来继续跟进。
wordpiece是比word更小的概念，有何好处？能解决oov吗。需要再次思考。

引用

https://albertauyeung.github.io/2020/06/19/bert-tokenization.html
https://spacy.io/usage/spacy-101
https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer
https://zhuanlan.zhihu.com/p/160813500
https://github.com/google/sentencepiece
https://huggingface.co/transformers/tokenizer_summary.html
https://huggingface.co/blog/how-to-train

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

机器学习

深度学习

Bert

tokenizer

WordPiece