是否可以使用 sed 可靠地转义正则表达式元字符

2023-11-21

我想知道是否有可能写一个100%可靠的sed命令转义输入字符串中的任何正则表达式元字符，以便可以在后续 sed 命令中使用它。像这样：

#!/bin/bash
# Trying to replace one regex by another in an input file with sed

search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"

# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")

# Use it in a sed command
sed "s/$search/$replace/" input

我知道有更好的工具可以处理固定字符串而不是模式，例如awk, perl or python。我只是想证明这是否可能sed。我想说让我们专注于基本的 POSIX 正则表达式以获得更多乐趣！ :)

我尝试了很多事情，但任何时候我都可以找到一个破坏我尝试的输入。我认为将其抽象为script to escape不会引导任何人走向错误的方向。

顺便说一句，讨论出现了here。我认为这可能是收集解决方案并可能打破和/或详细阐述它们的好地方。

Note:

如果您正在寻找预先打包的功能基于此答案中讨论的技术：
- bash功能使稳健的转义即使在多线替换可以在以下位置找到这篇文章的底部（加上一个perl使用的解决方案perl对此类转义的内置支持）。
- @EdMorton's answer contains a tool (bash script) that robustly performs single-line substitutions.
  - 艾德现在的答案是an improved的版本sed下面使用的命令，修正为卡莱斯蒂奥的回答，如果你想要的话，这是需要的转义字符串文字以供潜在使用other正则表达式处理工具，例如awk and perl.简而言之：用于交叉工具使用，\必须转义为\\而不是作为[\]，这意味着：而不是
    sed 's/[^^]/[&]/g; s/\^/\\^/g'下面使用的命令，您必须使用
    sed 's/[^^\]/[&]/g; s/[\^]/\\&/g;'
以下所有片段均假设bash作为 shell（可以进行符合 POSIX 的重新表述）：

单线解决方案

转义字符串文字以用作regex in `sed`:

^{To give credit where credit is due: I found the regex used below in this answer.}

假设搜索字符串是single-行字符串：

search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3'  # sample input containing metachars.

searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.

sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'

Every character except ^ is placed in its own character set [...] expression to treat it as a literal.
- 注意^是一个字符。你cannot表示为[^]，因为它在该位置具有特殊含义（否定）。
Then, ^ chars. are escaped as \^.
- 请注意，您不能通过放置一个来转义每个字符\在它前面，因为这可以将文字字符转换为元字符，例如\< and \b是某些工具中的单词边界，\n是一个换行符，\{是 RE 间隔的开始，例如\{1,3\}, etc.

该方法很稳健，但效率不高。

The 鲁棒性来自于not尝试预测所有特殊的正则表达式字符- 这会因正则表达式方言而异 - 但只关注 2 个功能所有正则表达式方言共享:

指定字符集中文字字符的能力。
逃避字面意义的能力^ as \^

转义字符串文字以用作替换字符串 in `sed`'s `s///`命令：

a 中的替换字符串sed s///命令不是正则表达式，但它可以识别占位符引用正则表达式匹配的整个字符串（&) 或按索引 (\1, \2, ...)，因此必须将它们与（习惯的）正则表达式分隔符一起转义，/.

假设替换字符串是single-行字符串：

replace='Laurel & Hardy; PS\2' # sample input containing metachars.

replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it

sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is

多线解决方案

转义多行字符串文字以用作regex in `sed`:

Note：这只有在以下情况下才有意义多条输入线（可能是全部）在尝试匹配之前已被读取。
由于诸如sed and awk操作于single默认情况下一次读取一行，需要额外的步骤才能使它们一次读取多行。

# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'

# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n')           #'

# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"

多行输入字符串中的换行符必须转换为'\n' strings，这就是正则表达式中换行符的编码方式。
$!a\'$'\n''\\n'追加string '\n'除了最后一行之外的每个输出行（最后一个换行符被忽略，因为它是由<<<)
tr -d '\n然后删除所有actual字符串中的换行符 (sed每当打印其模式空间时就添加一个），有效地将输入中的所有换行符替换为'\n'字符串。

-e ':a' -e '$!{N;ba' -e '}'是符合 POSIX 标准的形式sed读着的成语all输入行是一个循环，因此后续命令可以同时对所有输入行进行操作。
- 如果您正在使用GNU sed（仅），您可以使用它-z简化一次读取所有输入行的选项：
  sed -z "s/$searchEscaped/foo/" <<<"$search"

转义多行字符串文字以用作替换字符串 in `sed`'s `s///`命令：

# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'

# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}

# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"

输入字符串中的换行符必须保留为实际换行符，但是\-逃脱了。
-e ':a' -e '$!{N;ba' -e '}'是符合 POSIX 标准的形式sed读着的成语all输入线循环。
's/[&/\]/\\&/g逃脱所有&, \ and /实例，如单行解决方案中一样。
s/\n/\\&/g' then \-为所有实际换行添加前缀。
IFS= read -d '' -r用于读取sed命令的输出as is（以避免自动删除命令替换（$(...)）将执行）。
${REPLY%$'\n'}然后删除一个single尾随换行符，其中<<<已隐式附加到输入。

`bash`功能基于上述（对于`sed`):

quoteRe()用于在 a 中使用的引号（转义符）regex
quoteSubst()引用用于替换字符串 of a s/// call.
both handle multi-line input correctly
- 请注意，因为sed读到single默认情况下，在某个时间行，使用quoteRe()多行字符串仅在以下情况下才有意义sed一次显式读取多行（或全部）行的命令。
- 另外，使用命令替换（$(...)）调用函数对于具有以下内容的字符串不起作用trailing换行符；在这种情况下，使用类似的东西IFS= read -d '' -r escapedValue <(quoteSubst "$value")

# SYNOPSIS
#   quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }

# SYNOPSIS
#  quoteSubst <text>
quoteSubst() {
  IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
  printf %s "${REPLY%$'\n'}"
}

Example:

from=$'Cost\(*):\n$3.' # sample input containing metachars. 
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.

# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"

注意使用-e ':a' -e '$!{N;ba' -e '}'一次读取所有输入，以便多行替换起作用。

`perl`解决方案：

Perl 具有内置支持用于转义任意字符串以供正则表达式中的文字使用：quotemeta()功能或其等价物\Q...\E quoting.
对于单行和多行字符串，该方法是相同的；例如：

from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.

# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"

注意使用-0777一次读取所有输入，以便多行替换起作用。
The -s选项允许放置-<var>=<val>-style Perl 变量定义如下--在脚本之后、任何文件名操作数之前。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

regex

sed