来自龙书第二版,第3.5.3节“Lex 中的冲突解决”:
We have alluded to the two rules that Lex uses to decide on the proper lexeme
to select, when several prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the
pattern listed first in the Lex program.
上述规则也适用于 Flex。这是什么Flex手册说(第7章:输入如何匹配。)
When the generated scanner is run, it analyzes its input looking for strings
which match any of its patterns. If it finds more than one match, it takes the
one matching the most text (for trailing context rules, this includes the length
of the trailing part, even though it will then be returned to the input). If it
finds two or more matches of the same length, the rule listed first in the flex
input file is chosen.
如果我理解正确的话,你的词法分析器会将关键字视为Endif
作为标识符,因此之后它将被视为表达式的一部分。如果这是你的问题,只需将关键字规则放在您的规范之上,比如下面这样:(假设每个大写的单词是一个预定义的枚举,对应一个token)
"If" { return IF; }
"Then" { return THEN; }
"Endif" { return ENDIF; }
"While" { return WHILE; }
"Do" { return DO; }
"EndWhile" { return ENDWHILE; }
\"(\\.|[^\\"])*\" { return STRING; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
那么关键字将始终匹配在标识符之前由于规则 2。
EDIT:
谢谢你的评论,kol。我忘记添加字符串规则。但我不认为我的解决方案是错误的。例如,如果一个标识符称为If_this_is_an_identifier
, rule 1将应用,因此标识符规则将生效(因为它匹配最长的字符串)。我编写了一个简单的测试用例,发现我的解决方案没有问题。这是我的 lex.l 文件:
%{
#include <iostream>
using namespace std;
%}
ID [a-zA-Z_][a-zA-Z0-9_]*
%option noyywrap
%%
"If" { cout << "IF: " << yytext << endl; }
"Then" { cout << "THEN: " << yytext << endl; }
"Endif" { cout << "ENDIF: " << yytext << endl; }
"While" { cout << "WHILE: " << yytext << endl; }
"Do" { cout << "DO: " << yytext << endl; }
"EndWhile" { cout << "ENDWHILE: " << yytext << endl; }
\"(\\.|[^\\"])*\" { cout << "STRING: " << yytext << endl; }
{ID} { cout << "IDENTIFIER: " << yytext << endl; }
. { cout << "Ignore token: " << yytext << endl; }
%%
int main(int argc, char* argv[]) {
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}
我使用以下测试用例测试了我的解决方案:
If If_this_is_an_identifier > 0 Then read(b); Endif
c := "If I were...";
While While_this_is_also_an_identifier > 5 Do d := d + 1 Endwhile
它给了我以下输出(与您提到的问题无关的其他输出将被忽略。)
IF: If
IDENTIFIER: If_this_is_an_identifier
......
STRING: "If I were..."
......
WHILE: While
IDENTIFIER: While_this_is_also_an_identifier
lex.l 程序是根据示例进行修改的灵活手册 http://westes.github.io/flex/manual/Simple-Examples.html:(使用相同的方法来匹配标识符中的关键字)
还可以看看ANSI C 语法、Lex 规范 http://www.lysator.liu.se/c/ANSI-C-grammar-l.html.
我在我的个人项目中也使用了这种方法,到目前为止我没有发现任何问题。