删除标点符号但保留表情符号？

2023-11-22

是否可以删除所有标点符号但保留表情符号，例如

:-(

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

1. 一个有效的纯正则表达式解决方案（又名 Edit#2）

这个任务can做完了purely使用正则表达式（非常感谢@Mike Samuel）

首先我们建立一个表情符号数据库：

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

示例性输入文本：

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个辅助函数，它转义一些特殊字符，以便它们可以在正则表达式模式中使用（使用stringi包裹）：

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

匹配表情符号的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

现在，正如@Mike Samuel 在下面建议的那样，我们只需匹配(emoticon)|punctuation（注意表情符号在捕获组中）然后替换匹配项捕获组 1 的结果（所以如果它是表情符号，我们有 replacement=这个表情符号，如果是标点符号，我们有 replacement=nothing）。这会起作用，因为与|在 ICU Regex 中（这是用于stri_replace_all_regex) is 贪婪和左偏：表情符号将先于标点符号进行匹配。

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

顺便说一句，如果您只想删除选定的一组字符，请输入例如[.,]代替[\\p{P}] above.

2.正则表达式解决方案提示 - 我的第一次（不明智的）尝试（又名原始答案）

我的第一个想法（主要出于“历史原因”而留在这里）是通过使用来解决这个问题前瞻和后瞻，但是 - 正如您所看到的 - 这远非完美。

删除所有: and ;没有跟随), (, D, X, 8, [, or ]使用负向后看：

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

现在我们可以添加一些老式的表情符号（带有鼻子，例如:-), ;-D etc.)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

现在删除连字符（否定向后看并向前看）

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"

等等。当然，首先您应该建立自己的表情符号（保留原样）和标点符号（删除）数据库。正则表达式将高度依赖于这两组，因此很难添加新的表情符号——它绝对不值得应用（并且可能会扭曲你的大脑）。

3.第二次尝试（正则表达式-哑巴友好，又名Edit#1）

另一方面，如果您对复杂的正则表达式过敏，请尝试这个。这种方法有一些“说教的好处”——我们对以下每个步骤中所做的事情有充分的了解：

找到其中的所有表情符号text;
找到其中的所有标点符号text;
查找不属于表情符号的标点符号的位置；
删除步骤 3 中的字符。

示例性输入文本 - 仅 1 个字符串 - 留下通用案例作为练习；）

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

一个辅助函数，用于转义一些特殊字符，以便它们可以在正则表达式中使用：

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

匹配表情符号的正则表达式：

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

找到所有表情符号的开始和结束位置（即找到第一个表情符号）OR第二OR...表情符号）：

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69

找到所有标点符号（此处\\p{P} is the Unicode 字符类代表标点符号）：

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107

由于表情符号中出现了一些标点符号，我们不应该将它们暂存以进行删除：

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107

每个标点符号肯定只包含 1 个字符，因此总是where_punct[,1]==where_punct[,2].

现在是最后一部分。如你所见，where_punct[,1]包含要删除的字符的位置。恕我直言，最简单的方法（没有循环）是将字符串转换为 UTF-32（每个字符 == 1 个整数），删除不需要的元素，然后返回到文本表示：

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

结果是：

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

给你。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)