Emoji表情符号用于文本情感分析-Improving sentiment analysis accuracy with emoji embedding

2023-11-19

Abstract:
Due to the diversity and variability of Chinese syntax and semantics, accurately identifying and distinguishing individual emotions from online texts is challenging. To overcome this limitation, we incorporate a new source of individual sentiment, emojis, which contain thousands of graphic symbols and are increasingly being used for expressing emotion in online conversations. We examined popular sentiment analysis algorithms, including rule-based and classification algorithms, to evaluate the impact of supplementing emojis as additional features to improve the algorithm performance. Emojis were also translated into corresponding sentiment words when constructing features for comparison with those directly generated from emoji label words. In addition, considering different functions of emojis in texts, we classified all posts in the dataset by their emoji usage and examined the changes in algorithm performance. We found that emojis are effective as expanding features for improving the accuracy of sentiment analysis algorithms, and the algorithm performance can be further increased by taking different emoji usages into consideration. In this study, we developed an improved emoji-embedding model based on Bi-LSTM (namely, CEmo-LSTM), which achieves the highest accuracy (around 0.95) when analyzing online Chinese texts. We applied the CEmo-LSTM algorithm to a large dataset collected from Weibo from December 1, 2019 to March 20, 2020 to understand the sentiment evolution of online users during the COVID-19 pandemic. We found that the pandemic remarkably impacted individual sentiments and caused more passive emotions (e.g., horror and sadness). Our novel emoji-embedding algorithm creatively combined emojis as well as emoji usage with the sentiment analysis model and can handle emotion mining tasks more effectively and efficiently.
由于汉语句法和语义的多样性和可变性,准确识别和区分网络文本中的个人情感是一项挑战。为了克服这一限制,我们加入了一种新的个人情感来源,表情符号,它包含数千个图形符号,越来越多地被用于在线对话中表达情感。我们研究了流行的情感分析算法,包括基于规则的算法和分类算法,以评估补充表情符号作为额外特征对提高算法性能的影响。在构造特征以与表情符号标签词直接生成的特征进行比较时,表情符号也被翻译成相应的情感词。此外,考虑到表情符号在文本中的不同功能,我们根据表情符号的使用情况对数据集中的所有帖子进行分类,并检查算法性能的变化。我们发现表情符号作为扩展特征对于提高情感分析算法的准确性是有效的,并且通过考虑不同表情符号的使用可以进一步提高算法的性能。在本研究中,我们开发了一种基于Bi-LSTM(即CEmo-LSTM)的改进表情嵌入模型,该模型在分析在线中文文本时达到了最高的准确率(约0.95)。我们将CEmo LSTM算法应用于2019年12月1日至2020年3月20日从微博收集的一个大型数据集,以了解2019冠状病毒疾病大流行期间在线用户的情绪演变。我们发现,这种流行病显著影响了个人情绪,并导致更多的消极情绪(例如,恐惧和悲伤)。我们的新表情嵌入算法创造性地将表情以及表情的使用与情感分析模型相结合,可以更有效地处理情感挖掘任务。

Main Work:
However, these studies mainly considered emojis as one feature and did not research the sentiment effects of emojis on the whole texts. Little attention has been given to the SA model combined with different emoji usages in texts.
In this study, we proposed an emoji-embedding architecture named CEmo-LSTM to improve the accuracy of sentiment identification and classification in SA tasks. We further evaluated the benefits of introducing emojis to the accuracy of SA in both the traditional rule-based and supervised learning algorithms. Additionally, the most effective approach for embedding emojis in SA algorithms was examined. We compared the performance of the CEmo-LSTM model with that of other mainstream SA models in different experimental settings. Finally, by collecting all posts and embedded emojis published by users on Weibo during the COVID-19 outbreak, we utilized CEmo-LSTM to analyze the sentiment evolution of online users and measured the impact of the COVID-19 pandemic on individual moods. To the best of our knowledge, this is the first study that comprehensively evaluates the effectiveness of introducing emoji usage into SA algorithms.
然而,这些研究主要将表情符号作为一个特征,而没有研究表情符号对整个文本的情感影响。很少有人关注SA模型与文本中不同表情符号的结合。
在本研究中,我们提出了一种表情符号嵌入架构CEmo-LSTM,以提高SA任务中情感识别和分类的准确性。我们进一步评估了在传统的基于规则和监督学习算法中引入表情符号对SA准确性的好处。此外,还研究了在SA算法中嵌入表情符号的最有效的ap方法。我们比较了CEmo-LSTM模型与其他主流SA模型在不同实验环境下的性能。最后,通过收集2019冠状病毒疾病爆发期间用户在微博上发布的所有帖子和嵌入表情,我们利用CEmo-LSTM分析了在线用户的情绪演变,并衡量了2019冠状病毒疾病疫情对个人情绪的影响。据我们所知,这是第一次全面评估将表情符号使用引入SA算法的有效性的研究。

Research Process:

  1. Data collection: We collected all data from Weibo that were posted publicly by users located in Wuhan (the capital of the Hubei province in China), including microblog text, posting time, author ID, and gender, from December 1, 2019 to March 20, 2020. By comparing the sentiments in posts published by Wuhan users before and after the COVID-19 outbreak, we can analyze the sentiment evolution of online users and further explore the impact of COVID-19 on individual moods. Overall, 38,183,194 microblog posts from 2,239,472 unique users were collected. We found that emotion tokens (i.e., emoji characters) were commonly used in Weibo posts. There were 15,609,843 posts containing emoji symbols, accounting for 40.88% of the total posts. In addition, 1,279,828 users used emojis at least once, accounting for 57.15% of all unique users.
    数据收集:从2019年12月1日至2020年3月20日,我们从位于武汉(中国湖北省省会)的用户公开发布的微博上收集了所有数据,包括微博文本、发布时间、作者ID和性别。通过比较2019冠状病毒疾病爆发前后武汉用户发表的帖子中的情绪,我们可以分析网络用户的情绪演变,进一步探讨2019冠状病毒疾病对个人情绪的影响。总体而言,共收集了2239472名独立用户的38183194篇微博帖子。我们发现,情感标记(即表情符号)在微博帖子中普遍使用。共有15609843条含有表情符号的帖子,占帖子总数的40.88%。此外,1279828名用户至少使用过一次表情符号,占所有唯一用户的57.15%。
  2. Annotation: Although there have been some annotated corpora on Chinese and English for SA [23,24], they do not explicitly model the interaction between emojis and text. To fill in this gap, we manually annotated a Chinese microblog corpus. A total of 10 annotators (graduate students majoring in data analytics) were engaged to label the corpus, which consists of 10,000 randomly selected microblog posts. The sentiment polarities of the posts were manually classified as positive, negative, and neutral, denoted by 1, -1, and 0, respectively (Table 1). The annotators were asked to label each post by considering both the plain text and embedded emojis.
    As there are several principal functions for which emojis are used (e.g., sentiment expression, sentiment enhancement, and sentiment modification) [25], the emoji usage of each post containing emojis was also annotated. Specifically, the emoji usage of each post was classified into three categories, strengthening, reversing (or revising), and uncertain, labelled by 1, -1, and 0, respectively, indicating whether the sentiment of the embedded emojis was consistent (1) or inconsistent (-1) with the sentiment of the text-only post (Table 2). The label 0 was used to denote when the effect of emojis in the post could not be confidently determined. We found that most emojis embedded in the posts were used to strengthen and clarify the sentiment of the original texts, accounting for approximately 73.6% of all posts with emojis included in the corpus. Finally, all 10,000 microblog posts were labelled with their sentiment polarities, of which 5499 posts containing emojis were also annotated with their emoji usages.
    注释:尽管有一些关于SA的中英文注释语料库[23,24],但它们并没有明确地模拟表情符号和文本之间的交互。为了填补这一空白,我们手动注释了一个中文微博语料库。共有10名注释员(数据分析硕士研究生)参与了语料库的标注工作,语料库由10000条随机选择的微博帖子组成。这些帖子的情感极性被手动分为积极、消极和中性,分别用1、-1和0表示(表1)。注释者被要求通过考虑纯文本和嵌入表情来标记每篇文章。
    由于使用表情符号有几个主要功能(例如,情感表达、情感增强和情感修改)[25],因此还对每个包含表情符号的帖子的表情符号用法进行了注释。具体而言,每个帖子的表情符号用法分为三类,强化、反转(或修订)和不确定,分别用1、-1和0标记,表明嵌入表情符号的情绪与纯文本帖子的情绪是一致的(1)还是不一致的(-1)(表2)。标签0用于在无法确定帖子中表情符号的效果时进行注释。我们发现,大多数嵌入在帖子中的表情符号被用来加强和澄清原文的情感,约占语料库中包含表情符号的所有帖子的73.6%。最后,所有10000条微博帖子都贴上了情感极性标签,其中5499条包含表情符号的帖子也标注了表情符号的用法。
  3. CEmo-LSTM model:
    The architecture of the CEmo-LSTM model.
    As illustrated in Figure 1, our model includes the input sentence, word (emoji) representation, word embedding layer, Bi-LSTM layer, dropout layer, and a softmax layer. Given an input post
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

Emoji表情符号用于文本情感分析-Improving sentiment analysis accuracy with emoji embedding 的相关文章

随机推荐

  • 栈头文件C语言

    Stack ADT h 栈模型头文件 数据类型定义 typedef char Name typedef struct stack node Name name struct stack node next Stack Node typede
  • GD32450i-EVAL学习笔记 6 - ADC

    目录 1 初始化ADC 1 1 使能RCU 1 2 设置频率 1 3 设置分辨率 1 4 设置数据对齐方式 1 5 使能扫描模式 1 6 设置触发模式 1 6 使能ADC 2 初始化通道规则 3 通道使能 4 软件触发使能 5 获取ADC的
  • C#List类容输出

    将集合中的类容 添加间隔符号后 输出为string List
  • vue之babel自动埋点

    公司项目是vue单页面技术 这天组长对我说 为了凸显我们的作用 做点外行或者新手看起来高深的东西 例如自动埋点 当时我的表情先是 然后摸鱼摸习惯了 就是这种状态 最后组长说领导的态度决定我们的绩效 没办法 为了money 无奈打开了老朋友
  • nacos集群部署遇到的问题总结

    问题一 内存不足 问题描述 nacos is starting with cluster Error occurred during initialization of VM Could not reserve enough space f
  • Eclipse android apk打包

    Eclipse android 开发更改apk名字 有以下几步 第一步 修改工程包名 在eclipse里 找到项目包和java包 原则上都一样 就可以按 F2 修改名字 随之 源 java也会得到相应的修改 然而每个 java文件都需要把如
  • Fortify 代码扫描安装使用教程

    前言 Fortify 能够提供静态和动态应用程序安全测试技术 以及运行时应用程序监控和保护功能 为实现高效安全监测 Fortify具有源代码安全分析 可精准定位漏洞产生的路径 以及具有1分钟1万行的扫描速度 Fortify SCA 支持丰富
  • 现在投资创客教育

    很多人会问创客要不要投资 就像吃西红柿鸡蛋面要不要用刀叉一样 你当然可以用 也可以用筷子 格物斯坦表示如果没有筷子 喜欢用手抓着吃也是可以的啊 投资人是路径的一种 不是唯一 更不是全部 投资创客教育是教育培训行业比较热门的话题 该行业未来是
  • Python基础语法学习之变量与赋值

    近几年Python飞速发展 开始学习Python的人群不在仅仅局限于编程开发者 许多其他行业的从业者也开始将Python作为自己的职业技能 本文仍然是针对零基础的初学者 继续学习Python的基础语法 变量与赋值 主要内容包括变量和赋值的概
  • el-checkbox-group限制勾选数量

  • 代理模式:静态代理和动态代理

    代理模式 代理对象增强对目标对象的功能 分类 静态代理 动态代理 jdk代理通过实现接口 cglib通过实现类实现 静态代理 JDK动态代理 CGLIB代理模式 通过回调拦截器方法实现代理对象的生成
  • 【计算机视觉】直接线性变换(DLT)求解P矩阵(2 使用SVD分解)(附MATLAB代码)

    引言 之前的帖子已经完成了一种计算直接线性变换的方法 是直接通过矩阵运算来进行的 不过随后得到的结果并不能满足精度要求 如果只是用来作为迭代优化的一个初值的话 对于精度的要求倒也不用那么高 但在查阅资料时又发现了另一种解法 是通过SVD分解
  • DataFrame对象创建

    文章目录 前言 一 创建DataFrame对象 二 使用步骤 前言 一 创建DataFrame对象 示例 将字典 年龄 23 22 21 岗位 客服 运营 公关 年购买量 10 15 8 转换成一个DataFrame对象 二 使用步骤 代码
  • 搭建主数据管理平台,解决多个系统数据不统一的问题

    如今的电子产业环境要求电子制造商必须能对产业中的各种变化做出迅速的反应 而这些变化产生了前所未有的大量数据 包括来自产品类别增加 供应中断和需求波动以及其它因素等多方面的数据 这些数据以多种格式存在于各种系统之中 虽然 企业为管理它们花费了
  • FastAPI准备之pydantic类型检查工具

    官方文档 https pydantic docs helpmanual io 这些组织都在用 Jupyter notebook FastAPI Amazon Web Services Uber Microsoft 优势 可以和IDE结合 可
  • hbase与spark笔试题(选择题)

    转自 https www cnblogs com cxzdy p 5388451 html http www bigdatastudy net show aspx id 175 cid 14 一 HBASE笔试题 HBase来源于哪篇博文
  • CVS命令深入研究 zz

    CVS命令深入研究 作者 leizhimin 日期 2006 11 2 环境 Windows server 2003 sp1简体中文版 cvsnt 2 5 03 2260 msi 目录 一 CVS命令整体结构 二 CVS帮助察看方法概述 三
  • 浅谈我所见识的数据治理项目

    开篇一张图 与正文不一定有关 图片来源于朋友圈 01 写在前面 熟悉笔者的朋友可能知道 笔者之前做的并非纯数据相关工作 产品或项目 笔者属于半路出家的数据人 之前也几乎没有直接接触过数据仓库 数据中台 数据平台等产品或项目 与数据库是一直打
  • 大数据Mapreduce编程——矩阵乘法

    编程要求 完成矩阵乘法的 Map 函数和 Reduce 函数 1 设计两个矩阵 3050 50100 在每个单元格中填入一个 0 99 的随机数 并写入 两个文件中 作为 Map 函数的输入 2 测试运行矩阵乘法的 MapReduce 框架
  • Emoji表情符号用于文本情感分析-Improving sentiment analysis accuracy with emoji embedding

    Abstract Due to the diversity and variability of Chinese syntax and semantics accurately identifying and distinguishing