YACC can parse input streams consisting of tokens with certain values. This clearly describes the relation YACC has with Lex, YACC has no idea what 'input streams' are, it needs preprocessed tokens. While you can write your own Tokenizer, we will leave that entirely up to Lex.
A note on grammars and parsers. When YACC saw the light of day, the tool was used to parse input files for compilers: programs. Programs written in a programming language for computers are typically *not* ambiguous - they have just one meaning. As such, YACC does not cope with ambiguity and will complain about shift/reduce or reduce/reduce conflicts. More about ambiguity and YACC "problems" can be found in 'Conflicts' chapter.
Let's say we have a thermostat that we want to control using a simple language. A session with the thermostat may look like this:
heat on
Heater on!
heat off
Heater off!
target temperature 22
New temperature set!
The tokens we need to recognize are: heat, on/off (STATE), target, temperature, NUMBER.
The Lex tokenizer (Example 4) is:
%{
#include <stdio.h>
#include "y.tab.h"
%}
%%
[0-9]+ return NUMBER;
heat return TOKHEAT;
on|off return STATE;
target return TOKTARGET;
temperature return TOKTEMPERATURE;
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
We note two important changes. First, we include the file 'y.tab.h', and secondly, we no longer print stuff, we return names of tokens. This change is because we are now feeding it all to YACC, which isn't interested in what we output to the screen. Y.tab.h has definitions for these tokens.
But where does y.tab.h come from? It is generated by YACC from the Grammar File we are about to create. As our language is very basic, so is the grammar:
commands: /* empty */
| commands command
;
command:
heat_switch
|
target_set
;
heat_switch:
TOKHEAT STATE
{
printf("\tHeat turned on or off\n");
}
;
target_set:
TOKTARGET TOKTEMPERATURE NUMBER
{
printf("\tTemperature set\n");
}
;
The first part is what I call the 'root'. It tells us that we have 'commands', and that these commands consist of individual 'command' parts. As you can see this rule is very recursive, because it again contains the word 'commands'. What this means is that the program is now capable of reducing a series of commands one by one. Read the chapter 'How do Lex and YACC work internally' for important details on recursion.
The second rule defines what a command is. We support only two kinds of commands, the 'heat_switch' and the 'target_set'. This is what the |-symbol signifies - 'a command consists of either a heat_switch or a target_set'.
A heat_switch consists of the HEAT token, which is simply the word 'heat', followed by a state (which we defined in the Lex file as 'on' or 'off').
Somewhat more complicated is the target_set, which consists of the TARGET token (the word 'target'), the TEMPERATURE token (the word 'temperature') and a number.
The previous section only showed the grammar part of the YACC file, but there is more. This is the header that we omitted:
%{
#include <stdio.h>
#include <string.h>
void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}
int yywrap()
{
return 1;
}
main()
{
yyparse();
}
%}
%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE
The yyerror() function is called by YACC if it finds an error. We simply
output the message passed, but there are smarter things to do. See
the 'Further reading' section at the end.
The function yywrap() can be used to continue reading from another file. It is called at EOF and you can than open another file, and return 0. Or you can return 1, indicating that this is truly the end. For more about this, see the 'How do Lex and YACC work internally' chapter.
Then there is the main() function, that does nothing but set everything in motion.
The last line simply defines the tokens we will be using. These are output using y.tab.h if YACC is invoked with the '-d' option.
lex example4.l
yacc -d example4.y
cc lex.yy.c y.tab.c -o example4
A few things have changed. We now also invoke YACC to compile our grammar,
which creates y.tab.c and y.tab.h. We then call Lex as usual. When
compiling, we remove the -ll flag: we now have our own main() function and
don't need the one provided by libl.
NOTE: if you get an error about your compiler not being able to
find 'yylval', add this to example4.l, just beneath #include
<y.tab.h>:
extern YYSTYPE yylval;
This is explained in the 'How Lex and YACC work internally' section.
A sample session:
$ ./example4
heat on
Heat turned on or off
heat off
Heat turned on or off
target temperature 10
Temperature set
target humidity 20
error: parse error
$
This is not quite what we set out to achieve, but in the interest of keeping the learning curve manageable, not all cool stuff can be presented at once.
As we've seen, we now parse the thermostat commands correctly, and even flag mistakes properly. But as you might have guessed by the weasely wording, the program has no idea of what it should do, it does not get passed any of the values you enter.
Let's start by adding the ability to read the new target temperature. In order to do so, we need to learn the NUMBER match in the Lexer to convert itself into an integer value, which can then be read in YACC.
Whenever Lex matches a target, it puts the text of the match in the character string 'yytext'. YACC in turn expects to find a value in the variable 'yylval'. In Example 5, we see the obvious solution:
%{
#include <stdio.h>
#include "y.tab.h"
%}
%%
[0-9]+ yylval=atoi(yytext); return NUMBER;
heat return TOKHEAT;
on|off yylval=!strcmp(yytext,"on"); return STATE;
target return TOKTARGET;
temperature return TOKTEMPERATURE;
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%
As you can see, we run atoi() on yytext, and put the result in yylval, where YACC can see it. We do much the same for the STATE match, where we compare it to 'on', and set yylval to 1 if it is equal. Please note that having a separate 'on' and 'off' match in Lex would produce faster code, but I wanted to show a more complicated rule and action for a change.
Now we need to learn YACC how to deal with this. What is called 'yylval' in Lex has a different name in YACC. Let's examine the rule setting the new temperature target:
target_set:
TOKTARGET TOKTEMPERATURE NUMBER
{
printf("\tTemperature set to %d\n",$3);
}
;
To access the value of the third part of the rule (ie, NUMBER), we need to use $3. Whenever yylex() returns, the contents of yylval are attached to the terminal, the value of which can be accessed with the $-construct.
To expound on this further, let's observe the new 'heat_switch' rule:
heat_switch:
TOKHEAT STATE
{
if($2)
printf("\tHeat turned on\n");
else
printf("\tHeat turned off\n");
}
;
If you now run example5, it properly outputs what you entered.
Let's repeat part of the configuration file we mentioned earlier:
zone "." {
type hint;
file "/etc/bind/db.root";
};
Remember that we already wrote a Lexer for this file. Now all we need to do is write the YACC grammar, and modify the Lexer so it returns values in a format YACC can understand.
In the lexer from Example 6 we see:
%{
#include <stdio.h>
#include "y.tab.h"
%}
%%
zone return ZONETOK;
file return FILETOK;
[a-zA-Z][a-zA-Z0-9]* yylval=strdup(yytext); return WORD;
[a-zA-Z0-9\/.-]+ yylval=strdup(yytext); return FILENAME;
\" return QUOTE;
\{ return OBRACE;
\} return EBRACE;
; return SEMICOLON;
\n /* ignore EOL */;
[ \t]+ /* ignore whitespace */;
%%
If you look carefully, you can see that yylval has changed! We no longer expect it to be an integer, but in fact assume that it is a char *. In the interest of keeping things simple, we invoke strdup and waste a lot of memory. Please note that this may not be a problem in many areas where you only need to parse a file once, and then exit.
We want to store character strings because we are now mostly dealing with names: file names and zone names. In a later chapter we will explain how to deal with multiple types of data.
In order to tell YACC about the new type of yylval, we add this line to the header of our YACC grammar:
#define YYSTYPE char *
The grammar itself is again more complicated. We chop it in parts to make it easier to digest.
commands:
|
commands command SEMICOLON
;
command:
zone_set
;
zone_set:
ZONETOK quotedname zonecontent
{
printf("Complete zone for '%s' found\n",$2);
}
;
This is the intro, including the aforementioned recursive 'root'. Please note that we specify that commands are terminated (and separated) by ;'s. We define one kind of command, the 'zone_set'. It consists of the ZONE token (the word 'zone'), followed by a quoted name and the 'zonecontent'. This zonecontent starts out simple enough:
zonecontent:
OBRACE zonestatements EBRACE
It needs to start with an OBRACE, a {. Then follow the zonestatements, followed by an EBRACE, }.
quotedname:
QUOTE FILENAME QUOTE
{
$$=$2;
}
This section defines what a 'quotedname' is: a FILENAME between QUOTEs. Then it says something special: the value of a quotedname token is the value of the FILENAME. This means that the quotedname has as its value the filename without quotes.
This is what the magic '$$=$2;' command does. It says: my value is the value of my second part. When the quotedname is now referenced in other rules, and you access its value with the $-construct, you see the value that we set here with $$=$2.
NOTE: this grammar chokes on filenames without either a '.' or a '/' in
them.
zonestatements:
|
zonestatements zonestatement SEMICOLON
;
zonestatement:
statements
|
FILETOK quotedname
{
printf("A zonefile name '%s' was encountered\n", $2);
}
;
This is a generic statement that catches all kinds of statements within the 'zone' block. We again see the recursiveness.
block:
OBRACE zonestatements EBRACE SEMICOLON
;
statements:
| statements statement
;
statement: WORD | block | quotedname
This defines a block, and 'statements' which may be found within.
When executed, the output is like this:
$ ./example6
zone "." {
type hint;
file "/etc/bind/db.root";
type hint;
};
A zonefile name '/etc/bind/db.root' was encountered
Complete zone for '.' found