Building a compiler Part. 1 of 4 Getting Started with lex

“Bless thee, Bottom! Bless thee! Thou art translated.
I see their knavery: this is to make an ass of me…”
— A Midsummer Night’s Dream, Act 3, Scene 1

A compiler is a program which translates the source code into object code (assembly or machine code). This object program can be loaded into memory and executed by CPU. Other translators are interpreters, assemblers and preprocessors which you can look into.

The goal of this series of posts will be to develop a compiler for a Turing complete language named 'TURQ', the first part will focus on the first phase of compilation: tokenization and introduction to the tool LEX/FLEX which we will use to construct the scanner for our language. The next part will deal with the grammar for a language, third with what is a Turing machine and what are the types of Turing machines. Finally in part 4 we will consolidate our learnings and program the language compiler for 'TURQ' which should be enough to get you started.

TURQ

Tokenization

The 'first phase' of a compiler is to convert the source code, which is viewed as a sequence of characters, into a sequence of tokens. This token can be an identifier, keyword, operator or any special symbol reserved by the language. It is done by the lexical analyzer.

When the compilation of a program begins, the compiler calls the syntax analyzer (parser, more on that and grammar in Part 2), the syntax analyzer gets the next token from the stream by a call to the lexical analyzer, the information about the token is a pair of numbers <token_id, value>, where value is a pointer to an entry in the symbol table which includes information specific to the token. The token_id is an integer identifying the type of token.

Interaction b/w the compiler components

Now how do we generate these tokens from the character stream? Of course, we have to scan the entire source code and map each pattern with its corresponding token. This is painful to do manually. This was one of the reasons why compilers were difficult to write, now due to utilities like LEX/FLEX, we can specify the pattern in a regex form and associate the corresponding action with it.

LEX

Lex is a tool to create the scanner from a patterns file which specifies the patterns and the action associated with each of them. Let's look into the general format of the lex file

definitions

patterns/actions

subroutines

The definitions section specifies any substitutions that will be used in the patterns, C code for initializing variables or including header files, defining macros or any other code enclosed b/w %{...%} also it is the place where start conditions are declared.

The patterns as mentioned earlier are specified using the regex notations. Some common patterns include:

a the literal a
* metacharacter matches 0 or more occurrences of preceding expressions
aa* any string containing only a's and of length at least 1
. a meta character which matches any character excluding \n
? metacharacter matches 0 or 1 occurrences of preceding expressions
a(bc)? matches patterns a and abc
[] character class
[a-z] matches a single lowercase character a to z
[^a-z] matches any character except lowercase letters

This should be enough (for this blog at least). Alos note that if you do not specify any action for a patterns, the default action is to ECHO it, printing the same pattern to the output stream.

The subroutines are just methods which can be used in the actions.

Identifying identifiers

As a sample program, let's write a scanner which recognizes identifiers, an identifier is any string of printable characters which begins with a letter and is followed by a combination of letters and digits.

LEX program to recognize identifiers

To run the program, create a file named identifiers.l and run it using

> lex identifiers.l

This creates a file named lex.yy.c

Compile and Run it using commands

> gcc lex.yy.c

> ./a.out

Note the order of the rules, if the second rule was to be shifted to top, all the single character valid identifiers would have been ignored.

Find the code at: identifier.l

Discard Comments

Another role of the lexical analyzer is to discard any comments or white spaces, to achieve this, we can use the start conditions to change the set of rules which are active or up for consideration.

The start conditions are declared as %x (exclusive, do not activate rules with no start condition) or %s (inclusive, can activate rules with no start condition), to use them, specify them before the rule in angular brackets.

The first rule is used to scan for the starting of the comment, once we are inside the comment, we should track the presence of the terminating */ or *'s followed by terminating /, this is given in the fourth rule. Now we need to ignore any characters within this, during which we should watch out to not consume any * of the terminating comment, hence the third rule.

Ignoring comments using start conditions

To run the program, create a file named comments.l and run it using

> lex comments.l

This creates a file named lex.yy.c

Compile and Run it using commands

> gcc lex.yy.c

> ./a.out

Find the code at: comments.l

References:

Lexer start conditions

Search This Blog

madvise