<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 2.0">
<title>Introduction to Geyacc</title>
</head>

<body bgcolor="#FFFFFF">

<table border="0" width="100%">
    <tr>
        <td><font size="6"><strong>Introduction to Geyacc</strong></font></td>
        <td align="right"><a href="index.html"><img
        src="image/previous.gif" alt="Previous" border="0"
        width="40" height="40"></a><a href="stages.html"><img
        src="image/next.gif" alt="Next" border="0" width="40"
        height="40"></a></td>
    </tr>
</table>

<hr size="1">

<p>This chapter introduces many of the basic concepts without
which the details of <em>geyacc</em> will not make sense. If you
do not already know how to use <em>yacc</em> or <font size="2"><em>GNU</em></font><em>
bison</em>, it is suggested to start by reading this chapter
carefully.</p>

<h2>Languages and Context-Free Grammars</h2>

<p>In order for <em>geyacc</em> to parse a language, it must be
described by a <em><strong>context-free grammar</strong></em>.
This means that you specify one or more <em><strong>syntactic
groupings</strong></em> and give rules for constructing them from
their parts. For example, in the Eiffel language, one kind of
grouping is called an <font color="#800080" size="2"
face="Courier New">expression</font>. One rule for making an
expression might be, &quot;An expression can be made up of a
minus sign and another expression&quot;. Another would be,
&quot;An expression can be an integer&quot;. As you can see,
rules are often recursive, but there must be at least one rule
which leads out of the recursion.</p>

<p>The most common formal system for presenting such rules for
humans to read is <em>Backus-Naur Form</em> or <font size="2"><em>BNF</em></font>,
which was developed in order to specify the language Algol 60.
Any grammar expressed in <font size="2">BNF</font> is a
context-free grammar. The input to <em>geyacc</em> is essentially
machine-readable <font size="2">BNF</font>.</p>

<p>Not all context-free languages can be handled by <em>geyacc</em>,
only those that are <font size="2">LALR(1)</font>. In brief, this
means that it must be possible to tell how to parse any portion
of an input string with just a single token of look-ahead.
Strictly speaking, that is a description of an <font size="2">LR(1)</font>
grammar, and <font size="2">LALR(1)</font> involves additional <a
href="algorithm.html#mysterious_reduce_reduce">restrictions</a>
that are hard to explain simply; but it is rare in actual
practice to find an <font size="2">LR(1)</font> grammar that
fails to be <font size="2">LALR(1)</font>.</p>

<p>In the formal grammatical rules for a language, each kind of
syntactic unit or grouping is named by a <a href="symbols.html"><em>symbol</em></a>.
Those which are built by grouping smaller constructs according to
grammatical rules are called <em><strong>nonterminal symbols</strong></em>;
those which can't be subdivided are called <em><strong>terminal
symbols</strong></em> or <em>token types</em>. A piece of input
corresponding to a single terminal symbol is called a <em><strong>token</strong></em>,
and a piece corresponding to a single nonterminal symbol a <em><strong>grouping</strong></em>.</p>

<p>Let's use the Eiffel language as an example of what symbols,
terminal and nonterminal, mean. The tokens of Eiffel are
identifiers, manifest constants (numeric, character and string),
and the various keywords, arithmetic operators and punctuation
marks. So the terminal symbols of a grammar for Eiffel include
`identifier', `number', `character', `string', plus one symbol
for each keyword, operator or punctuation mark: `if', `then',
`loop', `end', `class', `inherit', `plus-sign', `open-brace',
`close-brace', `comma' and many more. (These tokens can be
subdivided into characters, but that is a matter of lexicography,
not grammar.)</p>

<p>Here is a simple Eiffel routine subdivided into tokens:</p>

<blockquote>
    <pre><font color="#008080"><em>square </em>(<em>x</em>:<em> INTEGER</em>)</font>      -- identifier, open-paren,
                         -- identifier, colon, identifier,
                         -- close-paren
    <font color="#008080"><em><strong>do</strong></em></font>                   -- keyword `do'
        <font color="#008080"><em>foo </em>:=<em> x </em>*<em> x </em></font>    -- identifier, assign-sign,
                         -- identifier, star-sign, identifier
    <font color="#008080"><em><strong>end</strong></em></font>                  -- keyword `end'</pre>
</blockquote>

<p>The syntactic groupings of Eiffel include the expression, the
instruction, the assignment, the loop, etc. These are represented
in the grammar of Eiffel by nonterminal symbols <font
color="#800080"><tt>expression</tt></font>, <font color="#800080"><tt>instruction</tt></font>,<font
color="#800080" size="2" face="Courier New"> </font><font
color="#800080"><tt>assignment</tt></font> and <font
color="#800080"><tt>loop</tt></font>. The full grammar uses
dozens of additional language constructs, each with its own
nonterminal symbol, in order to express the meanings of these
four. The example above is a routine definition; it contains one
instruction. In the instruction, each <font color="#008080"><em><tt>x</tt></em></font>
is an expression and so is <font color="#008080"><em><tt>x </tt></em><tt>*</tt><em><tt>
x</tt></em></font>.</p>

<p>Each nonterminal symbol must have grammatical rules showing
how it is made out of simpler constructs. For example, one kind
of Eiffel instruction is the assignment instruction; this would
be described with a grammar rule which reads informally as
follows:</p>

<blockquote>
    <p>An <font color="#800080"><tt>instruction</tt></font> can
    be made up of an <font color="#800080"><tt>identifier</tt></font>,
    an assign-sign (<tt>:=</tt>) and an <font color="#800080"><tt>expression</tt></font>.</p>
</blockquote>

<p>There would be many other rules for <font color="#800080"><tt>instruction</tt></font>,
one for each kind of instruction in Eiffel.</p>

<p><a name="start_symbol">One</a> nonterminal symbol must be
distinguished as the special one which defines a complete
utterance in the language. It is called the <em><strong>start
symbol</strong></em>. In a compiler, this means a complete input
program. In the Eiffel language, the nonterminal symbol <font
color="#800080"><tt>class_declaration</tt></font> plays this
role. For example, <font color="#008080"><em><tt>1 </tt></em><tt>+
</tt><em><tt>2</tt></em></font> is a valid Eiffel expression <font
face="Symbol">-</font> a valid part of an Eiffel class <font
face="Symbol">-</font> but it is not valid as an entire Eiffel
class. In the context-free grammar of Eiffel, this follows from
the fact that <font color="#800080"><tt>expression</tt></font> is
not the start symbol.</p>

<p>The <em>geyacc</em> parser reads a sequence of tokens as its
input, and groups the tokens using the grammar rules. If the
input is valid, the end result is that the entire token sequence
reduces to a single grouping whose symbol is the grammar's start
symbol. If we use a grammar for Eiffel, the entire input must be
a <font color="#800080"><tt>class_declaration</tt></font>. If
not, the parser reports a syntax error.</p>

<h2>From Formal Rules to Geyacc Input</h2>

<p>A formal grammar is a mathematical construct. To define the
language for <em>geyacc</em>, you must write a file expressing
the grammar in <em>geyacc</em> syntax: a &quot;<a
href="description.html"><em>geyacc</em> grammar</a>&quot; file.</p>

<p>A nonterminal symbol in the formal grammar is represented in <em>geyacc</em>
input as an identifier, like an identifier in Eiffel. By
convention, it should be in lower case, such as <font
color="#800080"><tt>expression</tt></font>, <font color="#800080"><tt>instruction</tt></font>
or <font color="#800080"><tt>declaration</tt></font>.</p>

<p>The <em>geyacc</em> representation for a terminal symbol is
also called a <em>token type</em>. Token types as well can be
represented as Eiffel-like identifiers. By convention, these
identifiers should be upper case to distinguish them from
nonterminals: for example, <font color="#FF0000"><tt>NUMBER</tt></font>,
<font color="#FF0000"><tt>IDENTIFIER</tt></font> or <font
color="#FF0000"><tt>ASSIGN</tt></font>. The terminal symbol <font
color="#0000FF"><tt>error</tt></font> is reserved for <a
href="error.html">error recovery</a>.</p>

<p>A terminal symbol can also be represented as a character
literal, just like a C character constant, between single quotes.
You should do this whenever a token is just a single character
(parenthesis, plus-sign, etc.): use that same character in a
literal as the terminal symbol for that token. Note that Unicode
characters are supported. Just make sure that the input file is
using the UTF-8 encoding and starts with the
<a href="http://en.wikipedia.org/wiki/Byte_order_mark">BOM</a>
character.</p>

<p>The grammar rules also have an expression in <em>geyacc</em>
syntax. For example, here is the <em>geyacc</em> rule for an
Eiffel assignment instruction. The semicolon and the colon are <em>geyacc</em>
punctuation used in every rule.</p>

<blockquote>
    <pre><font color="#800080">instruction</font><font
color="#0000FF">:</font> <font color="#800080">identifier </font><font
color="#FF0000">ASSIGN</font><font color="#800080"> expression</font>
    <font color="#0000FF">;</font></pre>
</blockquote>

<h2><a name="semantic_values">Semantic Values</a></h2>

<p>A formal grammar selects tokens only by their classifications:
for example, if a rule mentions the terminal symbol `integer
constant', it means that any integer constant is grammatically
valid in that position. The precise value of the constant is
irrelevant to how to parse the input: if <font color="#808000"><tt>x+4</tt></font>
is grammatical then <font color="#808000"><tt>x+1</tt></font> or <font
color="#808000"><tt>x+3989</tt></font> is equally grammatical.</p>

<p>But the precise value is very important for what the input
means once it is parsed. A compiler is useless if it fails to
distinguish between <font color="#808000"><tt>4</tt></font>, <font
color="#808000"><tt>1</tt></font> and <font color="#808000"><tt>3989</tt></font>
as constants in the program! Therefore, each token in a <em>geyacc</em>
grammar has both a token type and a <em><strong>semantic value</strong></em>.</p>

<p>The token type is a terminal symbol defined in the grammar,
such as <font color="#FF0000"><tt>NUMBER</tt></font>, <font
color="#FF0000"><tt>IDENTIFIER</tt></font> or<font
color="#FF0000" size="2" face="Courier New"> </font><font
color="#FF0000"><tt>','</tt></font>. It tells everything you need
to know to decide where the token may validly appear and how to
group it with other tokens. The grammar rules know nothing about
tokens except their types.</p>

<p>The semantic value has all the rest of the information about
the meaning of the token, such as the value of an integer, or the
name of an identifier. (A token such as <font color="#FF0000"><tt>','</tt></font>
which is just punctuation doesn't need to have any semantic
value.)</p>

<p>For example, an input token might be classified as token type <font
color="#FF0000"><tt>NUMBER</tt></font> and have the semantic
value <font color="#808000"><tt>4</tt></font>. Another input
token might have the same token type <font color="#FF0000"><tt>NUMBER</tt></font>
but value <font color="#808000"><tt>3989</tt></font>. When a
grammar rule says that <font color="#FF0000"><tt>NUMBER</tt></font>
is allowed, either of these tokens is acceptable because each is
a <font color="#FF0000"><tt>NUMBER</tt></font>. When the parser
accepts the token, it keeps track of the token's semantic value.</p>

<p>Each grouping can also have a semantic value as well as its
nonterminal symbol. For example, in a calculator, an expression
typically has a semantic value that is a number. In a compiler
for a programming language, an expression typically has a
semantic value that is a tree structure describing the meaning of
the expression.</p>

<h2>Semantic Actions</h2>

<p>In order to be useful, a program must do more than parse
input; it must also produce some output based on the input. In a <em>geyacc</em>
grammar, a grammar rule can have an <a href="actions.html"><em>action</em></a>
made up of Eiffel instructions. Each time the parser recognizes a
match for that rule, the action is executed.</p>

<p>Most of the time, the purpose of an action is to compute the
semantic value of the whole construct from the semantic values of
its parts. For example, suppose we have a rule which says an
expression can be the sum of two expressions. When the parser
recognizes such a sum, each of the subexpressions has a semantic
value which describes how it was built up. The action for this
rule should create a similar sort of value for the newly
recognized larger expression. For example, here is a rule that
says an expression can be the sum of two subexpressions:</p>

<blockquote>
    <pre><font color="#800080">expr</font><font color="#0000FF">: </font><font
color="#800080">expr </font><font color="#FF0000">'+' </font><font
color="#800080">expr</font><font color="#0000FF">
        { $$ </font><font color="#008080">:=</font><font
color="#0000FF"> $1 </font><font color="#008080">+</font><font
color="#0000FF"> $3 }
    ;</font></pre>
</blockquote>

<p>The action says how to produce the semantic value of the sum
expression from the values of the two subexpressions.</p>

<h2>Generated Parser</h2>

<p>When you run <em>geyacc</em>, you give it a <em>geyacc</em>
grammar file as input. The output is an Eiffel class equipped
with features to parse the language described by the grammar.
This class is called a &quot;<em>geyacc</em> parser&quot;. Keep
in mind that the <em>geyacc</em> utility and the <em>geyacc</em>
parser are two distinct programs: the <em>geyacc</em> utility is
a program whose output is the <em>geyacc</em> parser that becomes
part of your program.</p>

<p>The job of the <em>geyacc</em> parser is to group tokens into
groupings according to the grammar rules <font face="Courier New">-</font>
for example, to build identifiers and operators into expressions.
As it does this, it runs the actions for the grammar rules it
uses.</p>

<p>The tokens come from a routine called the <a
href="parser.html#lexical_analyzer"><em>lexical analyzer</em></a>
that you must supply in some fashion (such as by writing it using
<a href="http://www.gobosoft.com/eiffel/gobo/gelex/index.html"><em>gelex</em></a>). The <em>geyacc</em>
parser calls the lexical analyzer each time it wants a new token.
It doesn't know what is &quot;inside&quot; the tokens (though
their semantic values may reflect this). Typically the lexical
analyzer makes the tokens by parsing characters of text, but <em>geyacc</em>
does not depend on this.</p>

<hr size="1">

<table border="0" width="100%">
    <tr>
        <td><address>
            <font size="2"><b>Copyright © 1998-2019</b></font><font
            size="1"><b>, </b></font><font size="2"><strong>Eric
            Bezault</strong></font><strong> </strong><font
            size="2"><br>
            <strong>mailto:</strong></font><a
            href="mailto:ericb@gobosoft.com"><font size="2">ericb@gobosoft.com</font></a><font
            size="2"><br>
            <strong>http:</strong></font><a
            href="http://www.gobosoft.com"><font size="2">//www.gobosoft.com</font></a><font
            size="2"><br>
            <strong>Last Updated:</strong> 20 September 2019</font><br>
            <!--webbot bot="PurpleText"
            preview="
$Date$
$Revision$"
            -->
        </address>
        </td>
        <td align="right" valign="top"><a
        href="http://www.gobosoft.com"><img
        src="image/home.gif" alt="Home" border="0" width="40"
        height="40"></a><a href="index.html"><img
        src="image/toc.gif" alt="Toc" border="0" width="40"
        height="40"></a><a href="index.html"><img
        src="image/previous.gif" alt="Previous" border="0"
        width="40" height="40"></a><a href="stages.html"><img
        src="image/next.gif" alt="Next" border="0" width="40"
        height="40"></a></td>
    </tr>
</table>
</body>
</html>
