写编译器：学习GNU Flex，写一个词法分析器

以下内容仅为个人学习笔记，非正规教程，难免有疏漏之处，请指出！

目标要分析词法的对象是一种叫TINY+的计算机语言。下面是一个Example，

char str;
int x, fact;
str:= 'sample program in TINY+ language- computes factorial';
read x;
if x>0 and x<100 then {don’t compute if x<=0}
    fact:=1;
    while x>0 do
        fact:=fact*x;
        x:=x-1
    end;
    write fact
end

什么是Flex

“Flex is a tool for generating scanners: programs which recognized lexical patterns in text. flex reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. flex generates as output a C source file, `lex.yy.c’, which defines a routine `yylex()’. This file is compiled and linked with the `-lfl’ library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.” — Vern Paxson

安装Flex

apt-get install flex

入门

flex通过读取一个*.l或者*.lex文件里的单词匹配规则生成C语言的实现代码。
一个*.l的文件里的结构大概如下，用%%分隔开来。

/* Definition */
%%
/* Rules */
%%
/* C Code */

用flex写一个查找读取文本所有整数的程序。
正则表达式里，[0-9]+表示正整数，[+-]?[0-9]+表示整数。定义下面的一个规则：

%%
[+-]?[0-9]+ { printf("%s\n", yytext); } /* Print integers */
\n {} /* newline */
. {} /* For others, do nothing */
%%

void main(){
    yylex();
}

int yywrap(){
    return 1;
}

保存为find.l，使用flex和gcc编译。

# flex find.l
# gcc lex.yy.c

编写一文本作输入样例如下，

test 12345 -1
test kkk lll abc123 555
+111 < 456

运行结果，

root@xiaoxia-pc:~/study/Experiment/lex# ./a.out < test.in 12345 -1 123 555 +111 456

进阶

统计一段C语言代码里符合要求的标识符名称。
一个符号要求的标识符名称，满足以下几点：
1、标识符只能包含数字，字母，下划线
2、不能以数字打头
3、不能是C语言关键字

因为关键字太多，不想写，这里只考虑(unsigned, int, return)

%{
/* 在这里面加入C代码 */
int count = 0;
%}
/* definition */
digit    [0-9]
number   {digit}+
letter   [A-Za-z]
identifier   (_|{letter})({letter}|{digit}|_)*

%%
[ \t]+ {}
\n {} /* 忽略换行 */
{number} {} /* number */
unsigned|int|return {} /* 关键字 */
{number}({letter}|{digit}|_)* {} /* 不能以数字开头 */
{identifier} { ++count; printf("Good identifier: %s\n", yytext); }
. {}

%%
int yywrap(){return 1;}
void main(){
    yylex();
    printf("Total number proper identifier is %d\n", count);
}

测试输入find.l

unsigned int rand_code(unsigned int* rand_seed, unsigned int generator)
{
return *rand_seed = ((*_rand_seed * 234generator + 1)>>16)&65535;
}

运行结果：

root@xiaoxia-pc:~/study/Experiment/lex# ./a.out < test.in Good identifier: rand_code Good identifier: rand_seed Good identifier: generator Good identifier: rand_seed Good identifier: _rand_seed Total number proper identifier is 5 root@xiaoxia-pc:~/study/Experiment/lex#

最后

现在要对下面的代码进行词法分析，

char str;
int x, fact;
str:= 'sample program in TINY+ language- computes factorial';
read x;
if x>0 and x<100 then {don’t compute if x<=0}
    fact:=1;
    while x>0 do
        fact:=fact*x;
        x:=x-1
    end;
    write fact
end

这个是我的编译原理作业。交作业那晚才知道要做，自己从头写一个词法分析器就没那么多时间了，于是找flex帮忙。

任务要求：

1 The input of the scanner is a source code file and the output of the scanner is a stream of tokens.

2 Your scanner should go for longest possible match i.e. a string ‘:=’ is to be identified as ‘ass-symbol’ and not as ‘:’ and ‘=’.

3 Token is represented as (Kind, Value). We use the following symbols to denote different kinds of tokens
KEY denotes reserved words
SYM denotes special symbols
ID denotes identifiers
NUM denotes numeric constants
STR denotes string constants

4 Check lexical errors: giving meaning error messages and the lines where errors occur. The kinds of lexical errors are:
Illegal character, that is, scanner may recognize a character that is not in the alphabet of TINY+, such as $ is an illegal character
The right bracket of a STRING is lost, such as ‘ scanner
The right delimiter of a comment is lost, such as: {this is an example

flex文件：

/*
 *         File: tiny+.l
 *         Author: Xiaoxia([email protected])
 *         Usage:  lex tiny+.l
 *                 gcc lex.yy.c
 *                 ./a.out [testfile]
 */
%option yylineno
%%

if|then|else|end|repeat|until|read|write|or|and|int|bool|char|while|do { 
    printf("Line %d: (KEY, %s)\n", yylineno, yytext); }  /* keyword */
":="|"="|"<"|"+"|"-"|"/"|"("|"*"|")"|";"|"<="|">="|">"|"," { 
    printf("Line %d: (SYM, %s)\n", yylineno, yytext); }              /* symbol */
'[^'\n]*'       { printf("Line %d: (STR, %s)\n", yylineno, yytext); }       /* string */
"{"[^}]*"}"     {}                                                          /* comment */
[0-9]+          { printf("Line %d: (NUM, %s)\n", yylineno, yytext); }       /* number */
[A-Za-z]([A-Za-z]|[0-9])*    { printf("Line %d: (ID, %s)\n", yylineno, yytext); } /* identifier */
[ \t]+          {}                                                  /* whiltespace */
\n              {}                                                  /* newline */
.               { if(*yytext == '{' || *yytext == '\'')
                    printf("Line %d: missing the right part of %s\n", yylineno, yytext);
                  else
                    printf("Line %d: illegal character %s\n", yylineno, yytext);
                  yyterminate();
                }                                                           /* others */
%%

int yywrap() {
    return 1;
}

void main(int argc, char** argv) {
    if(argc > 1)
        yyin = fopen(argv[1], "r");
    else
        yyin = stdin;
    yylex();
}

测试输入如下的Tiny+代码。

{this is an example}
int A,B;
bool C1, C2, C3;
char D;
D:= 'scanner;
while A<=B do
A:=A*2
end

运行结果，

《写编译器：学习GNU Flex，写一个词法分析器》有21个想法

Caboo 2011 年 10 月 24 日 18:12

坐沙发膜拜，求搞基。

回复 ↓

Xiaoxia 文章作者2011 年 10 月 24 日 22:14

恶心死了，keep away from me….

回复 ↓
whitefirer 2011 年 11 月 03 日 00:40

人家是有gf的人了~

回复 ↓
1. 元谷 2011 年 11 月 22 日 18:47
  
  你是怎么知道的？哈哈
  
  回复 ↓

phylixal 2011 年 10 月 24 日 20:48

简洁明了，深受启发

回复 ↓

Xiaoxia 文章作者2011 年 10 月 27 日 15:56

谢谢支持！！！

回复 ↓

Mike 2011 年 10 月 27 日 09:52

GLib里面好像也有词法分析器，

PS, flex就是GNU的 lex

回复 ↓

Xiaoxia 文章作者2011 年 10 月 27 日 15:56

GLib，是gnome lib？还是gtk lib？还是GNU lib？
嗯，安装了flex之后，会设置lex指向flex。

回复 ↓
1. Mike 2011 年 10 月 27 日 21:07
  
  GLib就这个名字, 不过是GNOME维护的. 大量使用于GTK+程序. 几乎是C编程”类”标准库(跨平台!)
  
  回复 ↓

Fan 2011 年 10 月 27 日 18:46

把博主所有的文章大概都看了,技术流啊,不过好像有点文章已经丢失

回复 ↓

Xiaoxia 文章作者2011 年 10 月 27 日 20:08

技术方面比较菜，望多多指点。
非技术的就算了，日常生活流水帐。

“不过好像有点文章已经丢失”什么意思呢？不明白。

回复 ↓

艳文 2011 年 10 月 27 日 22:06

小虾你的email是多少啊？？我想问个php正则表达式的问题呀。

回复 ↓

Xiaoxia 文章作者2011 年 10 月 27 日 23:15

在你收到的博客评论通知的邮件里有哦！！！

回复 ↓

狙击手 2011 年 10 月 29 日 14:58

佩服小虾的学习能力～

回复 ↓

Xiaoxia 文章作者2011 年 10 月 29 日 16:56

一点皮毛而已，没有像你一样深入发展的毅力！

回复 ↓

Jesse 2011 年 11 月 03 日 14:43

对于编译原理的课程很多东西还是手写来得明白有空还是手写个吧
等明白了再去研究lex和yacc的源码或许这样对词法和文法分析更有帮助：-）
不过话说你们的课程上的有点慢啊呵呵
对于内容我就不说什么了我不懂我看编译原理书时都跳过lex和yacc

回复 ↓

Xiaoxia 文章作者2011 年 11 月 03 日 16:24

谢谢提议:-P
很早之前已经写过一个啦，http://xiaoxia.org/?s=fakebasic
不想重复造轮子咯，研究了lex的源代码，用查表的方式比起教学里用switch的方式处理自动机，效率差太远了。前辈们都为我们打足了基础啊！

回复 ↓

匿名 2012 年 02 月 17 日 23:03

博主记住:术业要有专攻.好样的继续加油吧

回复 ↓

Xiaoxia 文章作者2012 年 02 月 18 日 02:37

哈，你把我的陈年旧文都挖出来了。

回复 ↓

kang 2014 年 12 月 18 日 13:02

你好. 最后一个例子中, keyword那样定义, 如果待分析文本里出现如iff ifz 这样的是否也会被匹配上呢?
identifier 那样定义, 想if ,while这样的关键字不也成为了标志符了吗?

回复 ↓

Xiaoxia 文章作者2015 年 04 月 18 日 23:34

好想回答你这个问题，不过我老了。。。

回复 ↓

发表回复取消回复

Caboo 2011 年 10 月 24 日 18:12

坐沙发膜拜，求搞基。

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 24 日 22:14
  
  恶心死了，keep away from me….
  
  回复 ↓
2. whitefirer 2011 年 11 月 03 日 00:40
  
  人家是有gf的人了~
  
  回复 ↓
  1. 元谷 2011 年 11 月 22 日 18:47
    
    你是怎么知道的？哈哈
    
    回复 ↓
phylixal 2011 年 10 月 24 日 20:48

简洁明了，深受启发

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 27 日 15:56
  
  谢谢支持！！！
  
  回复 ↓
Mike 2011 年 10 月 27 日 09:52

GLib里面好像也有词法分析器，

PS, flex就是GNU的 lex

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 27 日 15:56
  
  GLib，是gnome lib？还是gtk lib？还是GNU lib？
  嗯，安装了flex之后，会设置lex指向flex。
  
  回复 ↓
  1. Mike 2011 年 10 月 27 日 21:07
    
    GLib就这个名字, 不过是GNOME维护的. 大量使用于GTK+程序. 几乎是C编程”类”标准库(跨平台!)
    
    回复 ↓
Fan 2011 年 10 月 27 日 18:46

把博主所有的文章大概都看了,技术流啊,不过好像有点文章已经丢失

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 27 日 20:08
  
  技术方面比较菜，望多多指点。
  非技术的就算了，日常生活流水帐。
  
  “不过好像有点文章已经丢失”什么意思呢？不明白。
  
  回复 ↓
艳文 2011 年 10 月 27 日 22:06

小虾你的email是多少啊？？我想问个php正则表达式的问题呀。

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 27 日 23:15
  
  在你收到的博客评论通知的邮件里有哦！！！
  
  回复 ↓
狙击手 2011 年 10 月 29 日 14:58

佩服小虾的学习能力～

回复 ↓
1. Xiaoxia 文章作者2011 年 10 月 29 日 16:56
  
  一点皮毛而已，没有像你一样深入发展的毅力！
  
  回复 ↓
Jesse 2011 年 11 月 03 日 14:43

对于编译原理的课程很多东西还是手写来得明白有空还是手写个吧
等明白了再去研究lex和yacc的源码或许这样对词法和文法分析更有帮助：-）
不过话说你们的课程上的有点慢啊呵呵
对于内容我就不说什么了我不懂我看编译原理书时都跳过lex和yacc

回复 ↓
1. Xiaoxia 文章作者2011 年 11 月 03 日 16:24
  
  谢谢提议:-P
  很早之前已经写过一个啦，http://xiaoxia.org/?s=fakebasic
  不想重复造轮子咯，研究了lex的源代码，用查表的方式比起教学里用switch的方式处理自动机，效率差太远了。前辈们都为我们打足了基础啊！
  
  回复 ↓
匿名 2012 年 02 月 17 日 23:03

博主记住:术业要有专攻.好样的继续加油吧

回复 ↓
1. Xiaoxia 文章作者2012 年 02 月 18 日 02:37
  
  哈，你把我的陈年旧文都挖出来了。
  
  回复 ↓
kang 2014 年 12 月 18 日 13:02

你好. 最后一个例子中, keyword那样定义, 如果待分析文本里出现如iff ifz 这样的是否也会被匹配上呢?
identifier 那样定义, 想if ,while这样的关键字不也成为了标志符了吗?

回复 ↓
1. Xiaoxia 文章作者2015 年 04 月 18 日 23:34
  
  好想回答你这个问题，不过我老了。。。
  
  回复 ↓

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Xiaoxia[PG]

Yesterday is history, tomorrow is mistery, today is a gift!