Contents
A Bit of Theory: Structure of an Interpreter
Traditionally, the architecture of an interpreter looks like this:
+----------+ Token +--------+ Parse +-------------+
Source ===>| Lexical | =======> | Parser | ======> | Interpreter |
| Analyzer | Stream +--------+ Tree +-------------+
+----------+
The lexical analyzer layer (or lexer) translates the character stream’s source
code into a token stream. The tokens act as placeholders for lexemes which are
the literal strings in the source code, which allows the parser to deal with
categories of strings rather than the strings themselves. The parser then
generates a parse tree or some other internal representation of the program. It presents this to the interpreter layer, which traverses the structure
and executes the code as it goes.
The division of the grammar processing into two layers assumes that
a language is primarily context-free with some regular elements in place. The
Lexers can process rules of the form:
< A > ::= "a" | "a" < B >
where < A > and < B > are non-terminal symbols and a is some terminal.
These types of rules form a regular grammar. If you are not familiar with
language theory, but if you know about regular expressions, then you could also
state this as “lexers can process the part of the language that can be written
as regular expressions.”
The parser can process rules of the form:
Here alpha is any string of terminals and non-terminals. These rules form
the context-free grammar. The reason why they are referred to as context-free is that each parser rule is processed by itself. It does not matter what
rule was expanded before or after each rule. Most programming languages
are predominantly regular, with the exception of name resolution. Name resolution
is typically left to the interpreter layer because this is a context-sensitive
operation. The idea, though, is that the parser can build the basic
sequence of operations, and then the interpretive layer processes the semantics
of the language.
Logo has several properties that somewhat blur the line between these
layers. In some respects, it is much simpler than other programming languages,
but some aspects of the language do not lend themselves to this traditional
3-layer architecture.
Building a Grammar for Logo
When I decided to write an interpreter for Logo, I set out to find
a formal definition of the language. I wanted to see a Backus-Naur Form (BNF)
grammar from which I could construct my lexer and parser. Once I had that, it
would be a simple matter to construct a tree-walk interpreter. However, I could
not find one! First, there is no formal Logo standard. Logo is more of a group
of languages with a loosely similar set of behaviors and syntax. This is unsurprising in that Logo never received the sort of attention that would
call for standardization. I was surprised that its original
creators did not give that sort of definition. So I gathered as many Logo books
from the 70s and 80s as I could, and I started to look for clues to the formal
implementation of the language.
As I progressed in my study of the language’s grammar, it became apparent why
you typically do not find a formal grammar for the language. There really is
not that much syntax to describe! Worse, the syntax that is there is not
entirely context-free. To illustrate this idea, I want to first look at the
BNF definition of Lisp, and then from there, we will try to develop a BNF for
Logo. Since Logo is Lisp without parens, perhaps we can adapt the Lisp grammar.
Lisp BNF Grammar
Lisp is sometimes thought of as a programming language without a syntax. Lisp’s
creator did create a meta language (which he called M-Expressions). These
M-expressions, a fairly algebraic syntax, would then be
translated into S-Expressions. For instance, the M-Expression for function
application f[x;y] would be translated into (f x y) . To save
compute cycles, many early Lisp programmers wrote the S-Expressions manually.
Hence the prefixed set of lists that we all know and love today. This
S-Expression syntax (in simplified form) looks like this:
< S-Expression > ::= < Atomic-Symbol >
| "(" < S-Expression > "." < S-Expression > ")"
| < List >
< List > ::= "(" < List-Contents> ")"
< List-Contents > ::= < S-Expression >
| < List-Contents > < S-Expression >
< Atomic-Symbol > ::= < Letter > < Atom-Part >
< Atom-Part > ::= < Letter > < Atom-Part >
| < Number > < Atom-Part >
| ""
Note how we can find the beginning and end of everything. In Lisp, a function
evaluation/application is simply a list. The first element of the list is the
function, and the remainder of the list is the set of arguments. Say we want
to work out 2+3*4 . In Lisp, this would be (+ 2 (* 3 4)) . There is no
ambiguity in its order of operations, and parenthesis serve as our guideposts
when parsing the language. Pretty simple, right? Let’s try to put together a
similar grammar for Logo.
Let’s start with the concept of expressions. Logo does not have
two types of expressions like the original Lisp, so I will not refer
to it as having S-Expression rules. (Technically, modern Lisp usually
does not include the M-Expression grammar, but it retains the label
for heritage reasons.) Essentially, a Logo expression is either
a series of zero or more words or a list:
< Expression > ::= < Word > < Expression >
| < List > < Expression >
| ""
Note that there is no equivalent of the dotted pair in Logo. Now all
we need to do is define < Word > and < List > . A word in formal
Logo takes on two forms, quoted and unquoted. They can consist of any
character except for those that escape the word. Most of the
escape characters belong to informal Logo, which we will deal
with later. So, for now, let’s look at only the formal escapes.
Essentially, there are only two things in formal Logo that cause an
escape. These are whitespace characters and [] . Nothing else will cause
a word to terminate in formal Logo. There is also three possibility of
escaped characters using \ followed by the literal character. So
then we can build up a word grammar like this:
< Word > ::= < Word > < Word-Character >
| < Word-Character >
| ""
< Word-Character > ::= < Non-Escape-Char >
| "\" < Character >
< Non-Escape-Char > ::= /[^\[\][:space:]]/
< Character > ::= /./
Note: I am abusing BNF slightly by permitting myself to enter
sed-style regular expressions. Strict BNF would require I type out all
characters, and that’s not efficient!
So now, that just leaves the lists. Since our expressions are
essentially lists of words, we could render them as:
< List > ::= "[" < Expression > ”]"
If we put this all together, we have the following grammar:
< Expression > ::= < Word > < Expression >
| < List > < Expression >
| ""
< Word > ::= < Word > < Word-Character >
| < Word-Character >
| ""
< Word-Character > ::= < Non-Escape-Char >
| "\" < Character >
< Non-Escape-Char > ::= /[^\[\][:space:]]/
< Character > ::= /./
< List > ::= "[" < Expression > ”]"
Improving our grammar
What we have developed so far does describe the syntax of formal Logo. You may
notice, however, that this accepts essentially any string. The only real
constraints are that input cannot end directly after a \ and all [
characters must be balanced with a matching ] . The question now becomes
“Is our grammar useful?”
To be useful in specifying a language, a grammar must provide
some mechanism for identifying semantic units. What we have here only
divides everything into words and lists. There are a few other semantics
we should try to capture:
- Quoted vs. Unquoted words
- Numbers (Integers and Floats)
- Function Calls
Essentially, quoted words and numbers form the atomic symbols of Logo.
So, we can describe that reasonably easily:
< Atomic-Symbol > ::= < Quoted-Word >
| < Number >
< Quoted-Word > ::= '"' < Word >
< Number > ::= < Integer >
| < Float >
< Integer > ::= /[:digit:]+/
< Float > ::= < Integer > "." < Integer >
Now, that leaves function calls. In Logo, a function call
consists of a word followed by the arguments to the function.
Syntactically this is quite simple. Our previous grammar admits those
strings, but it does not separate them from the other word types. We
can correct this by modifying the expression rule to distinguish
between atomic symbols and evaluated words:
< Expression > ::= < Word > < Expression >
| < Atomic-Symbol > < Expression >
| < List > < Expression >
| ""
Now we can put this all together to make our improved Logo grammar:
< Expression > ::= < Word > < Expression >
| < Atomic-Symbol > < Expression >
| < List > < Expression >
| ""
< Word > ::= < Word > < Word-Character >
| < Word-Character >
| ""
< Word-Character > ::= < Non-Escape-Char >
| "\" < Character >
< Non-Escape-Char > ::= /[^\[\][:space:]]/
< Character > ::= /./
< List > ::= "[" < Expression > ”]"
< Atomic-Symbol > ::= < Quoted-Word >
| < Number >
< Quoted-Word > ::= '"' < Word >
< Number > ::= < Integer >
| < Float >
< Integer > ::= /[:digit:]+/
< Float > ::= < Integer > "." < Integer >
Context Sensitive Evaluation
I want to close by following up on something I implied earlier,
and that is that this BNF grammar does not fully equip us to generate
a parse tree for Logo. This is due to its lack of sigils to delineate
the evaluation of functions. Consider the following:
Now, consider how we may parse this. SUM is the function,
but its arguments could be interpreted in various ways. Suppose
we wanted to add parenthesis to indicate the order of evaluation. Here
are the possible evaluations:
1.) SUM 3 (PRODUCT 2 4)
2.) SUM 3 (PRODUCT 2) 4
3.) SUM 3 (PRODUCT) 4
Number 1 is correct, but how do we know this? The answer is that we
need to know how many arguments each function requires! Once we
know that sum and product require two arguments, the first version is
the only possible interpretation. That means the function
definitions form the context necessary to determine where
evaluations end. Hence we cannot simply create a context-free parser,
build a tree, and then walk it! We will see more of this when we
implement the evaluation layer.
Consider the same in Lisp:
Here we have true context-free parsing. So the simplicity of Logo
seems to complicate its implementation, albeit only slightly. In my
next post, I will detail the first steps in implementing Logo. See
you then!
Computers
|
Home
|
Humor
|
Links
|