如何使 Boost.Spirit.Lex 标记值成为匹配序列的子字符串（最好通过正则表达式匹配组）

2023-12-02

我正在编写一个简单的表达式解析器。它建立在基于 Boost.Spirit.Lex 标记的 Boost.Spirit.Qi 语法之上（版本 1.56 中为 Boost）。

令牌定义如下：

using namespace boost::spirit;

template<
    typename lexer_t
>
struct tokens
    : lex::lexer<lexer_t>
{
    tokens()
        : /* ... */,
          variable("%(\\w+)")
    {
        this->self =
            /* ... */ |
            variable;
    }

    /* ... */
    lex::token_def<std::string> variable;
};

现在我想要variable令牌值只是名称（匹配组(\\w+)) 无前缀%象征。我怎么做？

单独使用匹配组没有帮助。仍然值是完整字符串，包括前缀%.

有没有办法强制使用匹配组？

或者至少以某种方式在令牌的操作中引用它？

我也尝试使用这样的操作：

variable[lex::_val = std::string(lex::_start + 1, lex::_end)]

但编译失败。错误声称没有一个std::string构造函数重载可以匹配参数：

(const boost::phoenix::actor<Expr>, const boost::spirit::lex::_end_type)

更简单

variable[lex::_val = std::string(lex::_start, lex::_end)]

编译失败。出于类似的原因，现在只有第一个参数类型boost::spirit::lex::_start_type.

最后我尝试了这个（尽管它看起来像是一个很大的浪费）：

lex::_val = std::string(lex::_val).erase(0, 1)

但这也无法编译。这次编译器无法从const boost::spirit::lex::_val_type to std::string.

有什么办法可以解决这个问题吗？

简单的解决方案

正确的构造形式std::string属性值如下：

variable[lex::_val = boost::phoenix::construct<std::string>(lex::_start + 1, lex::_end)]

完全按照建议jv_在他（或她）comment.

boost::phoenix::construct由提供<boost/phoenix/object/construct.hpp>标头。或者使用<boost/phoenix.hpp>.

正则表达式解决方案

然而，上述解决方案仅在简单情况下才有效。并且排除了从外部（特别是配置数据）提供模式的可能性。例如，自从将模式更改为%(\\w+)%需要更改值构造代码。

这就是为什么能够从定义标记的正则表达式中引用捕获组会更好。

现在请注意，这仍然不完美，因为像这样的奇怪情况%(\\w+)%(\\w+)%仍然需要更改代码才能正确处理。这可以通过不仅配置令牌的正则表达式来解决，还意味着从匹配范围形成值。但这超出了问题的范围。对于许多情况来说，直接使用捕获组似乎足够灵活。

sehe in a comment其他地方指出，无法使用令牌正则表达式中的捕获组。更不用说标记实际上仅支持正则表达式的子集。（其中显着的差异例如缺乏对命名捕获组或忽略它们的支持！）。

我自己在这方面的实验也支持这一点。遗憾的是，无法使用捕获组。但是有一个解决方法 - 您必须在操作中重新应用正则表达式。

动作获取捕获范围

为了使它有点模块化，让我们从一个最简单的任务开始 - 一个返回的动作boost::iterator_range与指定捕获相对应的令牌匹配部分。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture
{
public:
    typedef lex::token_def<Attribute, Char, Idtype> token_type;
    typedef boost::basic_regex<Char> regex_type;

    explicit basic_get_capture(token_type const& token, int capture_index = 1)
        : token(token),
          regex(),
          capture_index(capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    boost::iterator_range<Iterator> operator ()(Iterator& first, Iterator& last, lex::pass_flags& /*flag*/, IdType& /*id*/, Context& /*context*/)
    {
        typedef boost::match_results<Iterator> match_results_type;

        match_results_type results;
        regex_match(first, last, results, get_regex());
        typename match_results_type::const_reference capture = results[capture_index];
        return boost::iterator_range<Iterator>(capture.first, capture.second);
    }

private:
    regex_type& get_regex()
    {
        if(regex.empty())
        {
            token_type::string_type const& regex_text = token.definition();
            regex.assign(regex_text);
        }
        return regex;
    }

    token_type const& token;
    regex_type regex;
    int capture_index;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture<Attribute, Char, Idtype> get_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture<Attribute, Char, Idtype>(token, capture_index);
}

该动作使用正则表达式（包括<boost/regex.hpp>).

以字符串形式获取捕获的操作

现在，由于捕获范围是一件好事，因为它不会为字符串分配任何新的内存，所以它毕竟是我们最终想要的字符串。因此，这里的另一项行动建立在前一项行动的基础上。

template<typename Attribute, typename Char, typename Idtype>
class basic_get_capture_as_string
{
public:
    typedef basic_get_capture<Attribute, Char, Idtype> basic_get_capture_type;
    typedef typename basic_get_capture_type::token_type token_type;

    explicit basic_get_capture_as_string(token_type const& token, int capture_index = 1)
        : get_capture_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    std::basic_string<Char> operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        boost::iterator_range<Iterator> const& capture = get_capture_functor(first, last, flag, id, context);
        return std::basic_string<Char>(capture.begin(), capture.end());
    }

private:
    basic_get_capture_type get_capture_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_get_capture_as_string<Attribute, Char, Idtype> get_capture_as_string(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_get_capture_as_string<Attribute, Char, Idtype>(token, capture_index);
}

这里没有魔法。我们只是做一个std::basic_string从更简单的操作返回的范围。

从捕获中分配值的操作

返回值的操作对我们来说没什么用处。最终目标是通过捕获设置代币价值。这是由最后一个动作完成的。

template<typename Attribute, typename Char, typename Idtype>
class basic_set_val_from_capture
{
public:
    typedef basic_get_capture_as_string<Attribute, Char, Idtype> basic_get_capture_as_string_type;
    typedef typename basic_get_capture_as_string_type::token_type token_type;

    explicit basic_set_val_from_capture(token_type const& token, int capture_index = 1)
        : get_capture_as_string_functor(token, capture_index)
    {
    }

    template<typename Iterator, typename IdType, typename Context>
    void operator ()(Iterator& first, Iterator& last, lex::pass_flags& flag, IdType& id, Context& context)
    {
        std::basic_string<Char> const& capture = get_capture_as_string_functor(first, last, flag, id, context);
        context.set_value(capture);
    }

private:
    basic_get_capture_as_string_type get_capture_as_string_functor;
};

template<typename Attribute, typename Char, typename Idtype>
basic_set_val_from_capture<Attribute, Char, Idtype> set_val_from_capture(lex::token_def<Attribute, Char, Idtype> const& token, int capture_index = 1)
{
    return basic_set_val_from_capture<Attribute, Char, Idtype>(token, capture_index);
}

讨论

这些动作的使用方式如下：

variable[set_val_from_capture(variable)]

您可以选择提供第二个参数作为要使用的捕获索引。它默认为1这在大多数情况下似乎都是合适的。

创建函数

set_val_from_capture (or get_capture_as_string or get_capture分别）是一个辅助函数，用于从模板参数中自动推导token_def。我们特别需要的是Char输入相应的正则表达式。

我不确定是否可以合理地避免这种情况，即使是这样，也会使调用运算符显着复杂化（特别是如果我们努力缓存正则表达式对象而不是每次重新构建它）。我的疑虑主要来自于不确定是否Char类型token_def是否需要与标记化序列字符类型相同。我认为它们不必相同。

重复令牌

该操作中绝对令人不快的部分是需要提供令牌本身作为重复的参数。

然而，需要令牌Char如上所述的类型and得到正则表达式！

在我看来，至少在理论上，我们可以基于某种方式“在运行时”获取令牌id操作的参数（我们目前忽略它）。但是我没有找到任何方法如何获得token_def基于令牌的标识符，无论是否来自context参数或词法分析器本身（可以作为this通过创建函数）。

可重复使用性

由于这些是操作，因此在更复杂的场景中它们并不是真正可重用的（开箱即用）。例如，如果您不仅想获取捕获，还想将其转换为某个数值，则必须以这种方式编写另一个操作，而不是在令牌上执行复杂的操作。

起初我试图实现这样的目标：

variable[lex::_val = get_capture_as_string(variable)]

它看起来更灵活，因为您可以轻松地在其周围添加更多代码 - 例如将其包装在某些转换函数中。

但我没能实现。虽然我感觉自己还不够努力。了解更多关于Boost.Phoenix肯定会对这里有很大帮助。

双重工作

所有这些解决方法并不妨碍我们做双重工作。都在正则表达式解析然后匹配。但正如一开始提到的，似乎没有更好的方法（不改变 Boost.Spirit 本身）。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)