作为文本分类问题的一部分,我正在尝试清理文本数据集。到目前为止,我删除了除文本之外的所有内容。标点符号、数字、表情符号——一切都被删除了。现在我尝试使用表情符号作为特征,因此我想保留单词和表情符号。
首先,我在文本中搜索表情符号,并将它们与其他单词/表情符号分开。这是因为每个表情符号都应该单独处理。因此,我搜索了一个表情符号,并在其两端填充了空格。
但我在弄清楚如何结合已知的单词和表情符号正则表达式时不知所措。这是我当前的代码:
import re
def clean_text(raw_text):
padded_emoji_text = pad_emojis(raw_text)
print("Emoji padded text: " + padded_emoji_text)
reg = re.compile("[^a-zA-Z]") # line a
# old regex to remove everything except words
letters_only_text = reg.sub(' ', raw_text)
print("Cleaned text: " + letters_only_text)
# Code to remove everything except text and emojis
# How?
def pad_emojis(raw_text):
print("Original Text: " + raw_text)
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]',
re.UNICODE)
#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',raw_text)
return new_text
text = "I am very #happy man! but???????? my wife???? is not ????????. 99/33"
clean_text(text)
当前操作数:
Original Text: I am very #happy man! but???????? my wife???? is not ????????. 99/33
Emoji padded text: I am very #happy man! but ???? ???? my wife ???? is not ???? ???? . 99/33
Cleaned text: I am very happy man but my wife is not
我想要实现的目标:
I am very happy man but ???? ???? my wife ???? is not ???? ????
问题:
1)如何将表情符号正则表达式与单词正则表达式一起添加到正则表达式编译中? (a行)
2)我还可以以更好的方式实现我所寻求的目标,即不必编写单独的函数来分隔表情符号并用空格填充它们?我不知何故觉得这是可以避免的。