BERT 文档嵌入

2024-01-06

我正在尝试使用 BERT 进行文档嵌入。我使用的代码是两个来源的组合。我用, and BERT 词嵌入教程 https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/。下面是代码，我将每个文档的前 510 个标记输入到 BERT 模型中。最后，我将 K 均值聚类应用于这些嵌入，但每个聚类的成员完全不相关。我想知道这怎么可能。也许我的代码有问题。如果您看一下我的代码并告诉它是否有问题，我将不胜感激。我使用 Google colab 来运行此代码。

# text_to_embedding function
import torch
from keras.preprocessing.sequence import pad_sequences

def text_to_embedding(tokenizer, model, in_text):
    '''
    Uses the provided BERT 'model' and 'tokenizer' to generate a vector
    representation of the input string, 'in_text'.

    Returns the vector stored as a numpy ndarray.
    '''

    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    MAX_LEN = 510

    # 'encode' will:
    #  (1) Tokenize the sentence
    #  (2) Prepend the '[CLS]' token to the start.
    #  (3) Append the '[SEP]' token to the end.
    #  (4) Map tokens to their IDs.
    input_ids = tokenizer.encode(
        in_text,                         # sentence to encode.
        add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
        max_length = MAX_LEN,            # Truncate all sentences.
        #return_tensors = 'pt'           # Return pytorch tensors.
    )

    # Pad our input tokens. Truncation was handled above by the 'encode'
    # function, which also makes sure that the '[SEP]' token is placed at the
    # end *after* truncating.
    # Note: 'pad_sequences' expects a list of lists, but we only have one
    # piece of text, so we surround 'input_ids' with an extra set of brackets.
    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",
                          value=0, truncating="post", padding="post")
    
    # Remove the outer list.
    input_ids = results[0]

    # Create attention masks.
    attn_mask = [int(i > 0) for i in input_ids]

    # Cast to tensors.
    input_ids = torch.tensor(input_ids)
    attn_mask = torch.tensor(attn_mask)

    # Add an extra dimension for the "batch" (even though there is only one
    # input in this batch)
    input_ids = input_ids.unsqueeze(0)
    attn_mask = attn_mask.unsqueeze(0)


    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Copy the inputs to the GPU
    input_ids = input_ids.to(device)
    attn_mask = attn_mask.to(device)

    # telling the model not to build the backward graph will make this
    # a little quicker.
    with torch.no_grad():

        # Forward pass, returns hidden states and predictions
        # This will return the logits rather than the loss because we have
        # not provided labels.
        outputs = model(
            input_ids = input_ids,
            token_type_ids = None,
            attention_mask = attn_mask)
        

        hidden_states = outputs[2]

        #Sentence Vectors
        #To get a single vector for our entire sentence we have multiple 
        #application-dependent strategies, but a simple approach is to 
        #average the second to last hiden layer of each token producing 
        #a single 768 length vector.
        # `hidden_states` has shape [13 x 1 x ? x 768]

        # `token_vecs` is a tensor with shape [? x 768]
        token_vecs = hidden_states[-2][0]

        # Calculate the average of all ? token vectors.
        sentence_embedding = torch.mean(token_vecs, dim=0)
        # Move to the CPU and convert to numpy ndarray.
        sentence_embedding = sentence_embedding.detach().cpu().numpy()

        return(sentence_embedding)


from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.cuda()

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

我不知道它是否解决了你的问题，但这是我的 2 美分：

您不必计算注意力蒙版并手动进行填充。看看文档 https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__。只需调用标记器本身：

results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
# Cast to tensors
...

您可以对最后一个隐藏层尝试相同的操作，而不是使用倒数第二个隐藏层的平均值；或者你可以使用向量表示[CLS]从最后一层开始

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

BERT 文档嵌入的相关文章

初始化整数变量以进行比较

我正在学习麻省理工学院的开放课件课程计算机科学和 Python 编程简介 https ocw mit edu courses electrical engineering and computer science 6 0001 introd
在嵌入中附加文件 (Discord.py)

我目前正在编写一个不和谐的机器人discord py Rewrite我想将图像附加到嵌入中但我无法弄清楚 import discord from discord ext import commands from discord impor
布尔 pandas 之间的操作对称性破缺。具有不等索引的系列

隐式索引匹配pandas用于不同之间的操作DataFrame Series很棒而且大多数时候它都有效但是我偶然发现了一个无法按预期工作的示例 import pandas as pd 0 21 0 import numpy as np
如果出现重复，则主键取正值

我有一个数据框df Key1 Key2 Value K11 K21 V1 K11 K21 V1 K13 K23 V2 K13 K23 V2 现在例如对于相同的键 K11 K21 组合我们有 2 个值一负一正如何从此 df 中仅获取正
在 macOS 中通过 Python 访问进程的压缩 RAM（顶部的 CMPRS）的方法？

我试图弄清楚如何从 Python 访问任何给定进程占用的实际 RAM 量我发现 psutil Process PID memory info rss 工作得很好直到操作系统决定开始压缩某些进程的 RAM 然后所有的 memory in
python是带有字符串的运算符行为[重复]

这个问题在这里已经有答案了我无法理解以下行为我正在创建 2 个字符串并使用 is 运算符来比较它对于第一种情况它的工作方式有所不同对于第二种情况它按预期工作当我使用逗号或空格时它显示是什么原因False与比较is当没有使用
Python多处理错误“ForkAwareLocal”对象没有属性“连接”

下面是我的代码我面临着多处理问题我看到这个问题之前已经被问过我已经尝试过这些解决方案但它似乎不起作用有人可以帮我吗 from multiprocessing import Pool Manager Class X def init
如何在Python 3中将文本流编码为字节流？

将字节流解码为文本流很容易 import io f io TextIOWrapper io BytesIO b Test nTest n utf 8 f readline 在这个例子中 io BytesIO b Test nTest n 是
Python HTTP Post 方法将响应返回为 magicmock 对象而不是值

我正在尝试使用 POST 方法触发某些 API 后检查响应状态代码响应状态代码是 Magicmock 实例类型我正在使用在 python 2 中工作但引发 TypeError 的比较运算符检查状态代码是否在 400 和 500 之间在P
如何在Python中捕获打印机事件

我想在打印机开始打印时捕获信号如果你告诉我如何获取将要打印的文档的路径那就太好了 pywin32print看起来很有用但不知道怎么用要获得新打印作业的通知您可以使用FindFirstPrinterChangeNotificatio
在 pandas 中获取组名称的有效方法

我有一个包含大约 300 000 行的 csv 文件我将其设置为按特定列分组每个组大约有 140 名成员总共 2138 个组我正在尝试生成组名称的 numpy 数组到目前为止我已经使用 for 循环来生成名称但处理所有内容都需
使用 pyspark awsglue 时显示 DataFrame

如何使用 awsglue 的 job etl 显示 DataFrame 我尝试了下面的代码但没有显示任何内容 df show code datasource0 glueContext create dynamic frame from c
如果每个区域内至少有 5 个连续行，如何在每个标题区域的末尾使用 Title[Name]2 发布新行？

我想在每个 Title 区域的末尾使用 Title Name 2 发布新行的最简单方法是通过一个计算连续行数的变量其中至少有 5 个连续行包含 1 1 1 1在每个标题区域内我不确定我对计数变量做错了什么也许确实必须在每个 Tit
如何创建始终有效的导入？

我正在努力在我的一个项目中建立一个工作结构问题是我有一个像这样的结构的主包和子包我遗漏了所有不必要的文件 code py mypackage init py mypackage work py mypackage utils py u
使用 selenium 进行身份验证 (Python)

我有指向我网站管理区域的链接是否可以使用 selenium 在给定的浏览器中启动这些 URI 链接而无需事先进行身份验证如果没有那么我如何使用 selenium 处理身份验证不确定您的意思但您可以仅使用选择器并在身份验证字段中
Python 3 在除两个大数时给出错误的输出？

a 15511210043330985984000000 25 b 479001600 12 c 6227020800 13 关于划分ans int a b c or ans int a b c we get ans等于5200299代替5
仅导出嵌入结构实现的方法子集

是否可以仅导出嵌入结构实现的方法的子集这是一种与减少代码复制和粘贴非常不同的方法吗还有更惯用的方法吗 type A struct func a A Hello fmt Println Hello func a A World fmt P
selenium.common.exceptions.WebDriverException：消息：服务

当我使用 selenium 控制 Chrome 时遇到了麻烦这是我的代码 from selenium import webdriver driver webdriver Chrome When i tried to operate it
Python组合目录中的所有csv文件并按日期时间排序

我有 2 年的每日数据分成每月文件我想将所有这些数据合并到一个按日期和时间排序的文件中我正在使用的代码组合了所有文件但不按顺序我正在使用的代码 import pandas as pd import glob os import cs
TypeError：无法使用抽象方法实例化抽象类 <...>

这是我的代码 from abc import ABC from abc import abstractmethod class Mamifiero ABC docstring for Mamifiero def init self self

随机推荐

如何在 Apache Spark 中分割输入文件

假设我有一个大小为 100MB 的输入文件它包含大量 CSV 格式的点经纬度对为了在 Apache Spark 中将输入文件拆分为 10 个 10MB 文件我应该做什么或者如何自定义拆分注意我想处理每个映射器中的点的子集 Sp
使用联接表的 JPA 一对多单向关系

我想评估现有项目的 JPA 数据库模型和 java 类存在并且当前通过自行生成的代码进行映射数据库模型和 java 类并不完美地结合在一起但自定义映射工作得很好尽管如此 JPA 的总体使用似乎还是值得一试如您所见我是 JPA 新
AngularJS - 如何禁用选项请求？

我注意到我的 Angular 也在每个 POST 请求之前创建 OPTIONS 请求我正在使用自定义 API 服务来处理 HTTP 请求 app service ApiService function http Process remot
Matplotlib：旋转补丁

我想在 matplotlib 中旋转一个矩形但是当我应用转换时矩形不再显示 rect mpl patches Rectangle 0 0120 0 0 1 1000 t mpl transforms Affine2D rotate de
帮我理解这段C代码 (*(void(*) ()) scode) ()

Source http milw0rm org papers 145 http milw0rm org papers 145 include
在android上以编程方式安装客户端证书而无需对话框？

我正在尝试使用以下代码以编程方式在 Android 上安装客户端证书 Intent clientCertInstall KeyChain createInstallIntent clientCertInstall putExtra KeyC
两个滚动条问题？

对此进行后续跟进post https stackoverflow com questions 7297211 jquery how can i reset the document scrollbar when i append a lay
使用 OAuth2.0 与 PHP 进行 C2DM（ClientLogin 已弃用！）

注意在您继续阅读之前请注意 C2DM 本身现已被弃用并被 GCM 取代 http developer android com guide google gcm c2dm html 原问题我们是否有用于实现 PHP 服务器端代码以使用
从整型常量表达式到空指针的转换

考虑以下代码 include
在uwp中获取我电脑中的所有进程

我在进行 UWP 开发时遇到问题在我的应用程序中我需要获取计算机中运行的所有应用程序详细信息包括其显示名称或可执行文件名称我的第一个解决方案是使用System Process类我这样做就像 private void Mybutto
如何在 clojure 中从子进程执行非阻塞读取 stdout？

我希望从 clojure 生成一个长期运行的子进程通过标准流与该进程进行通信使用conch https github com Raynes conch图书馆我可以生成并读取进程并从中读取数据out stream def my pr
您使用过“Stack”对象（.Net）在现实世界中的哪些用途

我们都读过或听说过堆栈类但我们中的许多人可能从未找到使用 LIFO 对象的理由我很想知道使用该对象的现实世界解决方案以及原因 http msdn microsoft com en us library system collection
为什么我在 Scala 中遇到 OutOfMemoryError 编译错误？

我正在开发一个 Lift 项目并尝试使用scala cc以及使用 jvm 参数的 scala 编译服务器 Xmx1024m Xss20m XX PermSize 64M XX MaxPermSize 512M XX CMSClassUnlo
Maven - 来自 java 项目的可执行文件

我需要使用 Maven 用于学校项目从单个 Maven 命令创建可执行文件我从未使用过 Maven 并在 stackoverlow 上尝试了许多解决方案该解决方案创建了一个 jar 文件但该文件从未打开这是我的项目结构 src c
在 Java 7 中如何可移植地获取文件存储的块大小？

我看过java nio file attribute Attributes and java nio file FileStore 但无法找到一种方法来发现磁盘文件的块大小这是一个article http lanai dietpizza
jQuery：有关于 jQuery Ribbon 插件的推荐吗？

我看到有几个 jQuery 插件试图重现 Microsoft 在 Word 2007 中引入的 Ribbon Fluent UI 我发现的包括 http code google com p jquery ui ribbon http cod
“[Index(['', ''], dtype='object')] 都不在 [列] 中”

我是 python 中的 panda 库出现错误 Index Dokuman Sinif dtype object 均不在列中 classes ekonomi spor teknoloji teknoloji saglik saglik
使用 SimpleXML 根据另一个属性值选择属性值[重复]

这个问题在这里已经有答案了我正在尝试使用 xml 文件和 SimpleXML 显示图像 XML 代码是
简单的 KVO 示例

我正在尝试做简单的 KVO 示例但遇到了问题这是我的 m 文件 import KVO ViewController h interface KVO ViewController property NSUInteger number en
BERT 文档嵌入

我正在尝试使用 BERT 进行文档嵌入我使用的代码是两个来源的组合我用 and BERT 词嵌入教程 https mccormickml com 2019 05 14 BERT word embeddings tutorial 下面是代

BERT 文档嵌入

BERT 文档嵌入 的相关文章

随机推荐

热门标签

BERT 文档嵌入的相关文章