+----------+  Token   +--------+ Parse   +-------------+
Source ===>| Lexical  | =======> | Parser | ======> | Interpreter |
           | Analyzer |  Stream  +--------+ Tree    +-------------+
           +----------+

The lexical analyzer layer (or lexer) translates the character stream’s source code into a token stream. The tokens act as placeholders for lexemes which are the literal strings in the source code, which allows the parser to deal with categories of strings rather than the strings themselves. The parser then generates a parse tree or some other internal representation of the program. It presents this to the interpreter layer, which traverses the structure and executes the code as it goes.

The division of the grammar processing into two layers assumes that a language is primarily context-free with some regular elements in place. The Lexers can process rules of the form:

< A > ::= "a" | "a" < B >

where < A > and < B > are non-terminal symbols and a is some terminal.
These types of rules form a regular grammar. If you are not familiar with language theory, but if you know about regular expressions, then you could also state this as “lexers can process the part of the language that can be written as regular expressions.”

< A > ::= alpha

Here alpha is any string of terminals and non-terminals. These rules form the context-free grammar. The reason why they are referred to as context-free is that each parser rule is processed by itself. It does not matter what rule was expanded before or after each rule. Most programming languages are predominantly regular, with the exception of name resolution. Name resolution is typically left to the interpreter layer because this is a context-sensitive operation. The idea, though, is that the parser can build the basic sequence of operations, and then the interpretive layer processes the semantics of the language.

Logo has several properties that somewhat blur the line between these layers. In some respects, it is much simpler than other programming languages, but some aspects of the language do not lend themselves to this traditional 3-layer architecture.

Building a Grammar for Logo

When I decided to write an interpreter for Logo, I set out to find a formal definition of the language. I wanted to see a Backus-Naur Form (BNF) grammar from which I could construct my lexer and parser. Once I had that, it would be a simple matter to construct a tree-walk interpreter. However, I could not find one! First, there is no formal Logo standard. Logo is more of a group of languages with a loosely similar set of behaviors and syntax. This is unsurprising in that Logo never received the sort of attention that would call for standardization. I was surprised that its original creators did not give that sort of definition. So I gathered as many Logo books from the 70s and 80s as I could, and I started to look for clues to the formal implementation of the language.

As I progressed in my study of the language’s grammar, it became apparent why you typically do not find a formal grammar for the language. There really is not that much syntax to describe! Worse, the syntax that is there is not entirely context-free. To illustrate this idea, I want to first look at the BNF definition of Lisp, and then from there, we will try to develop a BNF for Logo. Since Logo is Lisp without parens, perhaps we can adapt the Lisp grammar.

Lisp BNF Grammar

Lisp is sometimes thought of as a programming language without a syntax. Lisp’s creator did create a meta language (which he called M-Expressions). These M-expressions, a fairly algebraic syntax, would then be translated into S-Expressions. For instance, the M-Expression for function application f[x;y] would be translated into (f x y). To save compute cycles, many early Lisp programmers wrote the S-Expressions manually. Hence the prefixed set of lists that we all know and love today. This S-Expression syntax (in simplified form) looks like this:

< S-Expression >  ::= < Atomic-Symbol >
                      | "(" < S-Expression > "." < S-Expression > ")"
                      | < List >

< List >          ::= "(" < List-Contents> ")"

< List-Contents > ::= < S-Expression >
                      | < List-Contents > < S-Expression >

< Atomic-Symbol > ::= < Letter > < Atom-Part >

< Atom-Part >     ::= < Letter > < Atom-Part >
                      | < Number > < Atom-Part >
                      | ""

Note how we can find the beginning and end of everything. In Lisp, a function evaluation/application is simply a list. The first element of the list is the function, and the remainder of the list is the set of arguments. Say we want to work out 2+3*4. In Lisp, this would be (+ 2 (* 3 4)). There is no ambiguity in its order of operations, and parenthesis serve as our guideposts when parsing the language. Pretty simple, right? Let’s try to put together a similar grammar for Logo.

Formal Logo BNF Grammar

Let’s start with the concept of expressions. Logo does not have two types of expressions like the original Lisp, so I will not refer to it as having S-Expression rules. (Technically, modern Lisp usually does not include the M-Expression grammar, but it retains the label for heritage reasons.) Essentially, a Logo expression is either a series of zero or more words or a list:

< Expression > ::= < Word > < Expression >
                   | < List > < Expression >
                   | ""

Note that there is no equivalent of the dotted pair in Logo. Now all we need to do is define < Word > and < List >. A word in formal Logo takes on two forms, quoted and unquoted. They can consist of any character except for those that escape the word. Most of the escape characters belong to informal Logo, which we will deal with later. So, for now, let’s look at only the formal escapes. Essentially, there are only two things in formal Logo that cause an escape. These are whitespace characters and []. Nothing else will cause a word to terminate in formal Logo. There is also three possibility of escaped characters using \ followed by the literal character. So then we can build up a word grammar like this:

< Word >            ::= < Word > < Word-Character >
                        | < Word-Character >
                        | ""

< Word-Character >  ::= < Non-Escape-Char >
                        | "\" < Character >

< Non-Escape-Char > ::= /[^\[\][:space:]]/

< Character >       ::= /./

Note: I am abusing BNF slightly by permitting myself to enter sed-style regular expressions. Strict BNF would require I type out all characters, and that’s not efficient!

So now, that just leaves the lists. Since our expressions are essentially lists of words, we could render them as:

< List > ::= "[" < Expression > ”]"

< Expression >      ::= < Word > < Expression >
                        | < List  > < Expression >
                        | ""

< Word >            ::= < Word > < Word-Character >
                        | < Word-Character >
                        | ""

< Word-Character >  ::= < Non-Escape-Char >
                        | "\" < Character >

< Non-Escape-Char > ::= /[^\[\][:space:]]/

< Character >       ::= /./

< List >            ::= "[" < Expression > ”]"

Improving our grammar

What we have developed so far does describe the syntax of formal Logo. You may notice, however, that this accepts essentially any string. The only real constraints are that input cannot end directly after a \ and all [ characters must be balanced with a matching ]. The question now becomes “Is our grammar useful?”

To be useful in specifying a language, a grammar must provide some mechanism for identifying semantic units. What we have here only divides everything into words and lists. There are a few other semantics we should try to capture:

Essentially, quoted words and numbers form the atomic symbols of Logo. So, we can describe that reasonably easily:

< Atomic-Symbol > ::= < Quoted-Word >
                      | < Number >

< Quoted-Word >   ::= '"' < Word >

< Number >        ::= < Integer >
                      | < Float >

< Integer >       ::= /[:digit:]+/

< Float >         ::= < Integer > "." < Integer >

Now, that leaves function calls. In Logo, a function call consists of a word followed by the arguments to the function. Syntactically this is quite simple. Our previous grammar admits those strings, but it does not separate them from the other word types. We can correct this by modifying the expression rule to distinguish between atomic symbols and evaluated words:

< Expression >      ::= < Word > < Expression >
                        | < Atomic-Symbol > < Expression >
                        | < List > < Expression >
                        | ""

< Expression >      ::= < Word > < Expression >
                        | < Atomic-Symbol > < Expression >
                        | < List > < Expression >
                        | ""

< Word >            ::= < Word > < Word-Character >
                        | < Word-Character >
                        | ""

< Word-Character >  ::= < Non-Escape-Char >
                        | "\" < Character >

< Non-Escape-Char > ::= /[^\[\][:space:]]/

< Character >       ::= /./

< List >            ::= "[" < Expression > ”]"

< Atomic-Symbol > ::= < Quoted-Word >
                      | < Number >

< Quoted-Word >   ::= '"' < Word >

< Number >        ::= < Integer >
                      | < Float >

< Integer >       ::= /[:digit:]+/

< Float >         ::= < Integer > "." < Integer >

Context Sensitive Evaluation

I want to close by following up on something I implied earlier, and that is that this BNF grammar does not fully equip us to generate a parse tree for Logo. This is due to its lack of sigils to delineate the evaluation of functions. Consider the following:

SUM 3 PRODUCT 2 4

Now, consider how we may parse this. SUM is the function, but its arguments could be interpreted in various ways. Suppose we wanted to add parenthesis to indicate the order of evaluation. Here are the possible evaluations:

1.) SUM 3 (PRODUCT 2 4)
2.) SUM 3 (PRODUCT 2) 4
3.) SUM 3 (PRODUCT) 4

Number 1 is correct, but how do we know this? The answer is that we need to know how many arguments each function requires! Once we know that sum and product require two arguments, the first version is the only possible interpretation. That means the function definitions form the context necessary to determine where evaluations end. Hence we cannot simply create a context-free parser, build a tree, and then walk it! We will see more of this when we implement the evaluation layer.

(+ 3 (* 2 4))

Here we have true context-free parsing. So the simplicity of Logo seems to complicate its implementation, albeit only slightly. In my next post, I will detail the first steps in implementing Logo. See you then!

Analyzing the Logo Language

Aug 7th, 2023

Contents

A Bit of Theory: Structure of an Interpreter

Building a Grammar for Logo

Lisp BNF Grammar

Formal Logo BNF Grammar

Improving our grammar

Context Sensitive Evaluation