unicode 标志不适用于 Javascript 中的 RegEx

2024-03-05

我的代码无法检测运算符和非英语字符的使用:

const OPERATOR_REGEX = new RegExp(
  /(?!\B"[^"|“|”]*)\b(and|or|not|exclude)(?=.*[\s])\b(?![^"|“|”]*"\B)/,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));

https://codepen.io/thewebtud/pen/vYraavd?editors=1111 https://codepen.io/thewebtud/pen/vYraavd?editors=1111

而相同的代码使用 unicode 标志成功检测 regex101.com 上的所有运算符:https://regex101.com/r/FC84BH/1 https://regex101.com/r/FC84BH/1

JS 该如何解决这个问题呢?


请记住

  • \b (word b边界)可以写成(?:(?<=^)(?=\w)|(?<=\w)(?=$)|(?<=\W)(?=\w)|(?<=\w)(?=\W)) and
  • \B (non-word b边界)可以写成(?:(?<=^)(?=\W)|(?<=\W)(?=$)|(?<=\W)(?=\W)|(?<=\w)(?=\w))

并且具有 Unicode 意识\w模式是[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}] (see 使用 Javascript 替换文本字符串中的某些阿拉伯语单词 https://stackoverflow.com/a/66680311/3832970),这是 ECMAScript 2018+ 解决方案:

const w = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const nw = String.raw`[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`;
const uwb = String.raw`(?:(?<=^)(?=${w})|(?<=${w})(?=$)|(?<=${nw})(?=${w})|(?<=${w})(?=${nw}))`;
const unwb = String.raw`(?:(?<=^)(?=${nw})|(?<=${nw})(?=$)|(?<=${nw})(?=${nw})|(?<=${w})(?=${w}))`;

const OPERATOR_REGEX = new RegExp(
  String.raw`(?!${unwb}"[^"“”]*)${uwb}(and|or|not|exclude)(?=.*\s)${uwb}(?![^"“”]*"${unwb})`,
  'giu'
);

const query1 = '(Java or "化粧" or 化粧品)';
const query2 = '(Java or 化粧 or 化粧品)';

console.log(query1.split(OPERATOR_REGEX));
console.log(query2.split(OPERATOR_REGEX));
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

unicode 标志不适用于 Javascript 中的 RegEx 的相关文章

随机推荐