令牌和规则之间的真正区别是什么？

2023-12-20

我因为 Raku 的内置语法而被它吸引，并想尝试一下并编写一个简单的电子邮件地址解析器，唯一的问题是：我无法让它工作。

在找到真正可行的东西之前，我尝试了无数次迭代，但我很难理解为什么。

归根结底，一切都在改变token to rule.

这是我的示例代码：

grammar Email {
  token TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('[email protected] /cdn-cgi/l/email-protection');

不起作用，它只是打印Nil, but

grammar Email {
  rule TOP { <name> '@' [<subdomain> '.']* <domain> '.' <tld> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse('[email protected] /cdn-cgi/l/email-protection');

does工作并正确打印

｢[email protected] /cdn-cgi/l/email-protection｣
 name => ｢foo.bar｣
 subdomain => ｢baz｣
 domain => ｢example｣
 tld => ｢com｣

而我改变的只是token TOP to rule TOP.

根据我从文档中收集到的信息，这两个关键字之间的唯一区别是空格在rule，但不在token。如果这是真的，第一个示例应该可以工作，因为我想忽略模式各个部分之间的空白。

删除各部分之间的空间

rule TOP { <name>'@'[<subdomain>'.']*<domain>'.'<tld> }

将行为恢复为打印Nil.

有人能告诉我这里发生了什么事吗？

EDIT：改变TOP规则到一个regex相反，允许回溯也使它起作用。

问题仍然存在，怎么会rule { }（这与regex {:ratchet :sigspace }) 匹配时token { }（这与regex {:ratchet }）不是吗？

该电子邮件地址中没有任何空格，因此无论出于何种目的，它都应该立即失败

这个答案解释了问题，提供了一个简单的解决方案，然后进行了深入探讨。

你的语法有问题

First, your SO demonstrates what seems to be either an extraordinary bug or a common misunderstanding. See JJ's answer for the issue he's filed to follow up on it, and/or my footnote.^[4]

把错误/“错误”放在一边，你的语法指导 Rakunot匹配您的输入：

The [<subdomain> '.']*原子急切地消耗字符串'baz.example.'来自您的输入；
剩余的输入（'com') 无法匹配剩余原子 (<domain> '.' <tld>);
The :ratchet https://docs.raku.org/language/regexes#index-entry-regex_adverb_:ratchet-regex_adverb_:r-Ratchet这适用于tokens 表示语法引擎不会回溯到[<subdomain> '.']* atom.

因此整个比赛失败。

最简单的解决方案

让语法起作用的最简单的解决方案是附加! to the [<subdomain> '.']*你的模式token.

这具有以下效果：

如果有任何一个余 of the token失败（在子域原子之后），语法引擎将回溯到子域原子，删除最后一个匹配重复，然后再次尝试向前移动；
如果匹配再次失败，引擎将再次回溯到子域原子，丢弃另一个重复，然后重试；
语法引擎将重复上述操作，直到其余的token匹配或没有匹配项[<subdomain> '.']原子留下来回溯。

请注意，添加!到子域原子意味着回溯行为仅限于子域原子；如果域原子匹配，但 tld 原子不匹配，则令牌将失败而不是尝试回溯。这是因为整个要点token问题是，默认情况下，它们在成功后不会回溯到较早的原子。

玩 Raku、开发语法和调试

Nil作为来自已知（或认为）可以正常工作的语法的响应是很好的，并且在解析失败时您不希望有任何更有用的响应。

对于任何其他场景，都有更好的选择，总结如下我的回答如何改进语法错误报告？ https://stackoverflow.com/questions/19618287/how-can-error-reporting-in-grammars-be-improved/19640657#19640657.

特别是，对于尝试、开发语法或调试语法，迄今为止最好的选择是安装免费的 Comma 并使用其语法实时查看 https://commaide.com/docs/grammar-live-view特征。

修正你的语法；总体策略

Your grammar suggests ~~two~~ three options¹:

向前解析并进行一些回溯。（最简单的解决方案。）
向后解析。反写模式，将输入输出反转。
后期解析解析。

向前解析并进行一些回溯

Backtracking is a reasonable approach for parsing some patterns. But it is best minimized, to maximize performance, and even then still carries DoS risks if written carelessly.²

要打开整个令牌的回溯，只需将声明符切换为regex反而。 Aregex就像一个令牌，但专门支持像传统正则表达式一样的回溯。

另一种选择是坚持token并限制形态中可能回溯的部分。一种方法是附加一个!在一个原子之后让它回溯，明确地覆盖token的整体“棘轮”，否则当该原子成功并且匹配移动到下一个原子时就会启动：

token TOP { <name> '@' [<subdomain> '.']*! <domain> '.' <tld> }
                                         ????

替代方案!是插入:!ratchet关闭规则的一部分的“棘轮”，然后:ratchet再次打开棘轮，例如：

token TOP { <name> '@' :!ratchet [<subdomain> '.']* :ratchet <domain> '.' <tld> }

（您也可以使用r作为缩写ratchet, i.e. :!r and :r.)

向后解析

适用于某些场景的经典解析技巧是向后解析以避免回溯。

grammar Email {
  token TOP { <tld> '.' <domain> ['.' <subdomain> ]* '@' <name> }  
  token name { \w+ ['.' \w+]* }
  token domain { \w+ }
  token subdomain { \w+ }
  token tld { \w+ }
}
say Email.parse(flip '[email protected] /cdn-cgi/l/email-protection').hash>>.flip;
#{domain => example, name => foo.bar, subdomain => [baz], tld => com}

对于大多数人的需求来说可能太复杂了，但我想我会把它包含在我的答案中。

解析后解析

在上面，我提出了一个引入一些回溯的解决方案，以及另一个避免回溯的解决方案，但在丑陋、认知负荷等方面代价高昂（向后解析？！？）。

There's another very important technique that I overlooked until reminded by JJ's answer.¹ Just parse the results of the parse.

这是一种方法。我完全重构了语法，部分是为了更理解这种做事方式，部分是为了演示一些 Raku 语法功能：

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>
  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
say Email.parse('[email protected] /cdn-cgi/l/email-protection')<host><parts>

显示：

[｢baz｣ ｢buz｣ ｢example｣ ｢com｣]

虽然这个语法匹配与你的相同的字符串，并且像 JJ 一样进行后解析，但它显然非常不同：

语法被减少到三个标记。
The TOP令牌对泛型进行两次调用dotted-parts令牌，带有指定最小部件数的参数。
$<host> = ...捕获名称下的以下原子<host>.

（如果原子本身就是一个命名模式，那么这通常是多余的，就像在这种情况下一样——<dotted-parts>。但“点部分”是相当通用的；并参考second它的匹配（第一个出现before the @），我们需要写<dotted-parts>[1]。所以我通过命名来整理它<host>.)
The dotted-parts模式可能看起来有点具有挑战性，但实际上非常简单：
- 它使用量词子句 (** {min..max} https://docs.raku.org/language/regexes#index-entry-regex_quantifier_**-General_quantifier:_**_min..max) 表示任意数量的零件，只要它至少是最少的。
- 它使用修饰子句 (% <separator> https://docs.raku.org/language/regexes#index-entry-regex_%24PERCENT_SIGN-regex_%24PERCENT_SIGN%24PERCENT_SIGN-Modified_quantifier:_%24PERCENT_SIGN,_%24PERCENT_SIGN%24PERCENT_SIGN）这表示每个部分之间必须有一个点。
<host><parts>从解析树中提取与相关的捕获数据parts第二次使用的令牌TOP的规则dotted-parts。这是一个数组：[｢baz｣｢buz｣｢example｣｢com｣].

有时，人们希望在解析期间进行部分或全部重新解析，以便在调用时重新解析的结果已准备好.parse完成。

JJ 展示了一种对所谓动作进行编码的方法。这涉及：

创建一个“actions”类，其中包含名称与语法中的命名规则相对应的方法；
告诉解析方法使用该操作类；
如果规则成功，则调用具有相应名称的操作方法（同时规则保留在调用堆栈中）；
将规则对应的匹配对象传递给action方法；
操作方法可以做任何它喜欢做的事情，包括重新解析刚刚匹配的内容。

直接内联编写操作更简单，有时更好：

grammar Email {
  token TOP {
              <dotted-parts(1)> '@'
    $<host> = <dotted-parts(2)>

    # The new bit:
    {
      make (subs => .[ 0 .. *-3 ],
            dom  => .[      *-2 ],
            tld  => .[      *-1 ])

      given $<host><parts>
    }

  }
  token dotted-parts(\min) { <parts> ** {min..*} % '.' }
  token parts { \w+ }
}
.say for Email.parse('[email protected] /cdn-cgi/l/email-protection') .made;

显示：

subs => (｢baz｣ ｢buz｣)
dom => ｢example｣
tld => ｢com｣

Notes:

我直接内联了进行重新解析的代码。

（可以插入任意代码块（{...}）任何可以插入原子的地方。在我们使用语法调试器之前，一个经典的用例是{ say $/ }打印$/，匹配对象，因为它位于代码块出现的位置。）
如果放置一个代码块在最后正如我所做的那样，它几乎等同于一个动作方法。

（当规则以其他方式完成时将调用它，并且$/已经满员了。在某些情况下，内联匿名操作块是可行的方法。在其他情况下，像 JJ 那样将其分解为操作类中的命名方法会更好。）
make是操作代码的主要用例。

(All make所做的是将其参数存储在.made的属性$/，在此上下文中是当前解析树节点。结果存储者make如果回溯随后丢弃了封闭的解析节点，则会自动丢弃。通常这正是人们想要的。）
foo => bar形成一个Pair https://docs.raku.org/type/Pair.
The 后环修复[...]操作员 https://docs.raku.org/routine/%5B%20%5D#(Operators)_postcircumfix_%5B_%5D indexes its invocant:
- 在这种情况下只有一个前缀.没有明确的 LHS 所以invocant是吗”。 “它”是由given，即它（请原谅双关语）$<host><parts>.
The * in the index *-n是调用者的长度；所以[ 0 .. *-3 ]是除最后两个元素之外的所有元素$<host><parts>.
The .say for ... line ends in .made³, to pick up the maked value.
The make'd value 是一个由三对组成的列表$<host><parts>.

脚注

¹ I had truly thought my first two options were the two main ones available. It's been around 30 years since I encountered Tim Toady online. You'd think by now I'd have learned by heart his eponymous aphorism -- There Is More Than One Way To Do It!

² Beware "pathological backtracking" https://www.google.com/search?q=%22pathological+backtracking%22. In a production context, if you have suitable control of your input, or the system your program runs on, you may not have to worry about deliberate or accidental DoS attacks because they either can't happen, or will uselessly take down a system that's rebootable in the event of being rendered unavailable. But if you do need to worry, i.e. the parsing is running on a box that needs to be protected from a DoS attack, then an assessment of the threat is prudent. (Read Details of the Cloudflare outage on July 2, 2019 https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ to get a real sense of what can go wrong.) If you are running Raku parsing code in such a demanding production environment then you would want to start an audit of code by searching for patterns that use regex, /.../ (the ... are metasyntax), :!r (to include :!ratchet), or *!.

³ There's an alias for .made; it's .ast. I think it stands for A Sparse Tree or Annotated Subset Tree and there's a cs.stackexchange.com question https://cs.stackexchange.com/questions/95759/is-sparse-subtree-an-appropriate-term-for-what-i-describe-in-this-question that agrees with me.

⁴ Golfing your problem, this seems wrong:

say 'a' ~~ rule  { .* a } # ｢a｣

更一般地说，我thought之间的唯一区别token and a rule是后者注入了<.ws> at 每个重要空间 https://stackoverflow.com/a/48896144/1077672。但这意味着这应该有效：

token TOP { <name> <.ws> '@' <.ws> [<subdomain> <.ws> '.']* <.ws>
            <domain> <.ws> '.' <.ws> <tld> <.ws>
}

但事实并非如此！

起初这让我很害怕。两个月后写下这个脚注，我感觉不那么害怕了。

部分原因是我的猜测，自从第一个 Raku 语法原型通过 Pugs 发布以来，15 年来我一直找不到任何人报告这一点。这种猜测包括 @Larry 故意将它们设计为按其方式工作的可能性，而它是一个“错误”，主要是像我们这样的当前普通人的误解，试图解释为什么 Raku 会这样做我们对来源的分析——烘焙、原始设计文档、编译器源代码等。

此外，鉴于当前的“错误”行为似乎是理想且直观的（除了与文档相矛盾之外），我专注于解释我的巨大不适感 - 在这段未知长度的过渡时期，我不明白why它做得对——作为一种积极的体验。我希望其他人也能——或者，much更好的是，弄清楚到底发生了什么并让我们知道！

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

grammar

raku