[Parser] Build: PL By:JS

ReZero lol

initial

Origin: http://lisperator.net/pltut/ This is a tutorial on how to implement a programming language. If you ever wrote an interpreter or a compiler, then there is probably nothing new for you here. But, if you’re using regexps to “parse” anything that looks like a programming language, then please read at least the section on parsing. Let’s write less buggy code! The ToC on the right is in “simple-to-advanced” order. I’d recommend you not to skip forward, unless you know the subject well. You can always refer back if you don’t understand something. Also, questions and feedback are very much appreciated! The target audience is the average JavaScript / NodeJS programmer.

What are we going to learn

  • What is a parser, and how to write one.
  • How to write an interpreter.
  • Continuations, and why are they important.
  • Writing a compiler.
  • How to transform code to continuation-passing style.
  • A few basic optimization techniques.
  • Examples of what our λanguage brings new over plain JavaScript.

Description of the language

# this is a comment

println("Hello World!");

println(2 + 3 * 4);

# functions are introduced with `lambda` or `λ`
fib = lambda (n) if n < 2 then n else fib(n - 1) + fib(n - 2);

println(fib(15));

print-range = λ(a, b)             # `λ` is synonym to `lambda`
                if a <= b then {  # `then` here is optional as you can see below
                  print(a);
                  if a + 1 <= b {
                    print(", ");
                    print-range(a + 1, b);
                  } else println("");        # newline
                };
print-range(1, 5);

Output

Hello World!
14
610
1, 2, 3, 4, 5

The λanguage looks a bit like JavaScript, but it’s different. First, there are no statements, only expressions. An expression returns a value and can be used in place of any other expression. Semicolons are required to separate expressions in a “sequence”. The curly brackets, { and }, create such a sequence, and it’s itself an expression. Its value is what the last expression evaluates to. The following is a valid program:

a = {
  fib(10);  # has no side-effects, but it's computed anyway
  fib(15)   # the last semicolon can be missing
};
print(a); # prints 610

Functions are introduced with one of the keywords lambda or λ (they are synonyms). After the keyword there must be a (possibly empty) parenthesized list of variable names separated with commas, like in JavaScript — these are the argument names. The function body is a single expression, but it can be a sequence wrapped in {…}. There is no return statement (there are no statements) — the last expression evaluated in a function gives the value to return to its caller. There is no var. To introduce new variables, you can use what JavaScripters call “IIFE”. Use a lambda, declare variables as arguments. Variables have function scope, and functions are closures — like in JavaScript. Even if is itself an expression. In JavaScript you’d get that effect with the ternary operator:

a = foo() ? bar() : baz();           // JavaScript
a = if foo() then bar() else baz();  # λanguage

The then keyword is optional when the branch starts with an open bracket ({), as you can see in print-range above. Otherwise it is required. The else keyword is required if the alternative branch is present. Again, then and else take as body a single expression, but you can {group} multiple expressions by using brackets and semicolons. When the else branch is missing and the condition is false, the result of the if expression is false. Speaking of which, false is a keyword which denotes the only falsy value in our λanguage:

if foo() then print("OK");

will print “OK” if and only if the result of foo() is NOT false. There’s also a true keyword for completion, but really everything which is not false (in terms of JavaScript’s === operator) will be interpreted as true in conditionals (including the number 0 and the empty string “”). Also note above that there is no point to demand parentheses around an if’s condition. It’s no error if you add them, though, as an open paren starts an expression — but they’re just superfluous. A whole program is parsed as if it were embedded in curly brackets, therefore you need to place a semicolon after each expression. The last expression can be an exception.


Well, that’s our tiny λanguage. It’s not necessarily a good one. The syntax looks cute, but it has its traps. There are a lot of missing features, like objects or arrays; we don’t concentrate on them because they’re not essential for our journey. If you understand all this material, you’ll be able to implement those easily. In the next section we’ll write a parser for this λanguage.

origin: http://lisperator.net/pltut/

Summary

var ast = parse(TokenStream(InputStream(code))); Writing a parser is, depending on the language, a moderately complex task. In essence, it must transform a piece of code (which we inspect by looking at the characters) into an “abstract syntax tree” (AST). The AST is a structured in-memory representation of the program, and it’s “abstract” in the sense that it does not care exactly what characters is the source code made of, but it faithfully represents the semantics of it. For example, for the following program text:

sum = lambda(a, b) {
  a + b;
};
print(sum(1, 2));

our parser will generate the following AST, as a JavaScript object:

{
  type: "prog",
  prog: [
    // first line:
    {
      type: "assign",
      operator: "=",
      left: { type: "var", value: "sum" },
      right: {
        type: "lambda",
        vars: [ "a", "b" ],
        body: {
          // the body should be a "prog", but because
          // it contains a single expression, our parser
          // reduces it to the expression itself.
          type: "binary",
          operator: "+",
          left: { type: "var", value: "a" },
          right: { type: "var", value: "b" }
        }
      }
    },
    // second line:
    {
      type: "call",
      func: { type: "var", value: "print" },
      args: [{
        type: "call",
        func: { type: "var", value: "sum" },
        args: [ { type: "num", value: 1 },
                { type: "num", value: 2 } ]
      }]
    }
  ]
}

The main difficulty in writing a parser consists in a failure to properly organize the code. The parser should operate at a higher level than reading characters from a string. A few advices on how to keep complexity manageable:

  • Write many functions and keep them small. In every function, do one thing and do it well.

  • Do not try to use regexps for parsing. They don’t work. Regexps can be helpful in the lexer though, but I suggest to limit them to very simple things.

  • Don’t attempt to guess. When unsure how to parse something, throw an error and make sure the message contains the error location (line/column).

To keep it simple I’ve split my code in three parts, which are further divided into many small functions:

  • The character input stream
  • The token input stream (lexer)
  • The parser

Input stream

We’re going to create a “stream object” which provides operations to read characters from a string. A stream object has 4 methods:

  • peek() — returns the next value but without removing it from the stream.

  • next() — returns the next value and also discards it from the stream.

  • eof() — returns true if and only if there are no more values in the stream.

  • croak(msg) — does throw new Error(msg).

The reason why I’m including the last one is that the stream can easily keep track of the current location (i.e. line/column), which is important to display in the case of an error message. Feel free to add more methods here, depending on your needs. The character input stream deals with characters, so the values that next() / peek() return are chars (well, since JS doesn’t have a char type, they’re strings containing one character). Here is the full code of this object, which I will call “InputStream”. It’s small enough and you should have no problem to understand it:

function InputStream(input) {
    var pos = 0, line = 1, col = 0;
    return {
        next  : next,
        peek  : peek,
        eof   : eof,
        croak : croak,
    };
    function next() {
        var ch = input.charAt(pos++);
        if (ch == "\n") line++, col = 0; else col++;
        return ch;
    }
    function peek() {
        return input.charAt(pos);
    }
    function eof() {
        return peek() == "";
    }
    function croak(msg) {
        throw new Error(msg + " (" + line + ":" + col + ")");
    }
}

Note that it’s not a standard object (the kind you create with new). You just do var stream = InputStream(string) to get a stream object. Next we’re going to write another abstraction on top of this object: the tokenizer.

Token Stream

The tokenizer (also called “lexer”) operates on a character input stream and returns a stream object with the same interface, but the values returned by peek() / next() will be tokens. A token is an object with two properties: type and value. Here are some examples with supported tokens:

{ type: "punc", value: "(" }           // punctuation: parens, comma, semicolon etc.
{ type: "num", value: 5 }              // numbers
{ type: "str", value: "Hello World!" } // strings
{ type: "kw", value: "lambda" }        // keywords
{ type: "var", value: "a" }            // identifiers
{ type: "op", value: "!=" }            // operators

Whitespace and comments are skipped over, no tokens are returned. In order to write the tokenizer we need to look more closely into the syntax of our language. The idea is to notice that depending on the current character (as returned by input.peek()) we can decide what kind of token to read:

  • First off, skip over whitespace.
  • If input.eof() then return null.
  • If it’s a sharp sign (#), skip comment (retry after - the end of line).
  • If it’s a quote then read a string.
  • If it’s a digit, then we proceed to read a number.
  • If it’s a “letter”, then read an identifier or a keyword token.
  • If it’s one of the punctuation characters, return a punctuation token.
  • If it’s one of the operator characters, return an operator token.
  • If none of the above, error out with input.croak().

So here’s the “read_next” function — the “core” of the tokenizer — which implements the above:

function read_next() {
    read_while(is_whitespace);
    if (input.eof()) return null;
    var ch = input.peek();
    if (ch == "#") {
        skip_comment();
        return read_next();
    }
    if (ch == '"') return read_string();
    if (is_digit(ch)) return read_number();
    if (is_id_start(ch)) return read_ident();
    if (is_punc(ch)) return {
        type  : "punc",
        value : input.next()
    };
    if (is_op_char(ch)) return {
        type  : "op",
        value : read_while(is_op_char)
    };
    input.croak("Can't handle character: " + ch);
}

This is a “dispatcher” function and it’s what next() will call in order to fetch the next token. Note it uses many utilities that are focused on particular token types, like read_string(), read_number() etc. There’s no point to complicate the dispatcher with code from those functions, even if we never call them elsewhere. Another thing to notice is that we don’t consume all the input stream in one step. Each time the parser will call for next token, we read one token. In case of a parse error we don’t even reach the end of the stream.


read_ident() will read characters as long as they are allowed as part of an identifier (is_id). Identifiers must start with a letter, or λ or _, and can contain further such characters, or digits, or one of the following: ?!-<>=. Therefore, foo-bar will not be read as three tokens but as a single identifier (a “var” token). The reason for this rule is that I’d like to be able to define functions named is-pair? or string>= (sorry, it’s the Lisper in me).


Also, the read_ident() function will check the identifier against the list of known keywords, and if it’s there it will return a “kw” token, instead of a “var” one. I think the code pretty much speaks for itself now, so here is the complete tokenizer for our language. Couple of small other notes below.

function TokenStream(input) {
    var current = null;
    var keywords = " if then else lambda λ true false ";
    return {
        next  : next,
        peek  : peek,
        eof   : eof,
        croak : input.croak
    };
    function is_keyword(x) {
        return keywords.indexOf(" " + x + " ") >= 0;
    }
    function is_digit(ch) {
        return /[0-9]/i.test(ch);
    }
    function is_id_start(ch) {
        return /[a-zλ_]/i.test(ch);
    }
    function is_id(ch) {
        return is_id_start(ch) || "?!-<>=0123456789".indexOf(ch) >= 0;
    }
    function is_op_char(ch) {
        return "+-*/%=&|<>!".indexOf(ch) >= 0;
    }
    function is_punc(ch) {
        return ",;(){}[]".indexOf(ch) >= 0;
    }
    function is_whitespace(ch) {
        return " \t\n".indexOf(ch) >= 0;
    }
    function read_while(predicate) {
        var str = "";
        while (!input.eof() && predicate(input.peek()))
            str += input.next();
        return str;
    }
    function read_number() {
        var has_dot = false;
        var number = read_while(function(ch){
            if (ch == ".") {
                if (has_dot) return false;
                has_dot = true;
                return true;
            }
            return is_digit(ch);
        });
        return { type: "num", value: parseFloat(number) };
    }
    function read_ident() {
        var id = read_while(is_id);
        return {
            type  : is_keyword(id) ? "kw" : "var",
            value : id
        };
    }
    function read_escaped(end) {
        var escaped = false, str = "";
        input.next();
        while (!input.eof()) {
            var ch = input.next();
            if (escaped) {
                str += ch;
                escaped = false;
            } else if (ch == "\\") {
                escaped = true;
            } else if (ch == end) {
                break;
            } else {
                str += ch;
            }
        }
        return str;
    }
    function read_string() {
        return { type: "str", value: read_escaped('"') };
    }
    function skip_comment() {
        read_while(function(ch){ return ch != "\n" });
        input.next();
    }
    function read_next() {
        read_while(is_whitespace);
        if (input.eof()) return null;
        var ch = input.peek();
        if (ch == "#") {
            skip_comment();
            return read_next();
        }
        if (ch == '"') return read_string();
        if (is_digit(ch)) return read_number();
        if (is_id_start(ch)) return read_ident();
        if (is_punc(ch)) return {
            type  : "punc",
            value : input.next()
        };
        if (is_op_char(ch)) return {
            type  : "op",
            value : read_while(is_op_char)
        };
        input.croak("Can't handle character: " + ch);
    }
    function peek() {
        return current || (current = read_next());
    }
    function next() {
        var tok = current;
        current = null;
        return tok || read_next();
    }
    function eof() {
        return peek() == null;
    }
}
  • The next() function doesn’t always call read_next(), because it might have been peeked before (in which case read_next() was already called and the stream advanced). Therefore we need a current variable which keeps track of the current token.

  • We only support decimal numbers with the usual notation (no 1E5 stuff, no hex, no octal). But if we ever need more, the changes go only in read_number() and are pretty easy to do.

  • Unlike JavaScript, the only characters that cannot appear unquoted in a string are the quote character itself and the backslash. You need to backslash them. Otherwise strings can contain hard newlines, tabs, and whatnot. We don’t interpret the usual escapes like \n, \t etc. though again, the changes would be pretty trivial (in “read_string”).

We now have sufficiently powerful tools to easily write the parser, but first I’d recommend you to look at the description of the AST.

AST

num { type: "num", value: NUMBER }
str { type: "str", value: STRING }
bool { type: "bool", value: true or false }
var { type: "var", value: NAME }
lambda { type: "lambda", vars: [ NAME... ], body: AST }
call { type: "call", func: AST, args: [ AST... ] }
if { type: "if", cond: AST, then: AST, else: AST }
assign { type: "assign", operator: "=", left: AST, right: AST }
binary { type: "binary", operator: OPERATOR, left: AST, right: AST }
prog { type: "prog", prog: [ AST... ] }
let { type: "let", vars: [ VARS... ], body: AST }

Example:

let (a = 10, b = a * 10) {
  a + b;
}

To:

{
  "type": "let",
  "vars": [
    {
      "name": "a",
      "def": { "type": "num", "value": 10 }
    },
    {
      "name": "b",
      "def": {
        "type": "binary",
        "operator": "*",
        "left": { "type": "var", "value": "a" },
        "right": { "type": "num", "value": 10 }
      }
    }
  ],
  "body": {
    "type": "binary",
    "operator": "+",
    "left": { "type": "var", "value": "a" },
    "right": { "type": "var", "value": "b" }
  }
}

The parser

The parser creates AST nodes that are described in the AST section. Thanks to the work we did in the tokenizer, the parser operates on a stream of tokens instead of dealing with individual characters. It still defines many helpers to keep complexity down. I’ll discuss here the main functions that comprise the parser. Let’s start with a high level one, the lambda parser:

function parse_lambda() {
    return {
        type: "lambda",
        vars: delimited("(", ")", ",", parse_varname),
        body: parse_expression()
    };
}

This function will be invoked when the lambda keyword has already been seen and “eaten” from the input, so all it cares for is to parse the argument names; but they’re in parentheses and delimited by commas. Rather than placing that code in parse_lambda, I preferred to write a delimited function that takes these arguments: the start token, the end token, the separator, and a function which parses whatever must be between those start/end tokens. In this case, it’s parse_varname, which takes care to throw an error if it encounters anything which doesn’t look like a variable. The body of the function is an expression, so we get it with parse_expression. delimited is a bit lower-level:

function delimited(start, stop, separator, parser) {
    var a = [], first = true;
    skip_punc(start);
    while (!input.eof()) {
        if (is_punc(stop)) break;
        if (first) first = false; else skip_punc(separator);
        if (is_punc(stop)) break; // the last separator can be missing
        a.push(parser());
    }
    skip_punc(stop);
    return a;
}

As you can see, it uses more utilities: is_punc and skip_punc. The former will return true if the current token is the given punctuation sign (without “eating” it), while skip_punc will ensure that the current token is that punctuation (throws an error otherwise) and will discard it from the input. The function that parses the whole program is probably the simplest:

function parse_toplevel() {
    var prog = [];
    while (!input.eof()) {
        prog.push(parse_expression());
        if (!input.eof()) skip_punc(";");
    }
    return { type: "prog", prog: prog };
}

Since we have no statements, we simply call parse_expression() and read expressions until we get to the end of the input. Using skip_punc(";") we demand semicolons between these expressions. Another simple example: parse_if():

function parse_if() {
    skip_kw("if");
    var cond = parse_expression();
    if (!is_punc("{")) skip_kw("then");
    var then = parse_expression();
    var ret = { type: "if", cond: cond, then: then };
    if (is_kw("else")) {
        input.next();
        ret.else = parse_expression();
    }
    return ret;
}

It skips over the if keyword with skip_kw (and this throws an error if the current token is not the given keyword), reads the condition using parse_expression(). Next, if the consequent branch doesn’t start with a { we require the keyword then to be present (I feel like the syntax is too scarce without it). The branches are just expressions, so again we use parse_expression() for them. The else branch is optional so we need to check if the keyword is present before parsing it. Having many small utilities helps a lot in keeping the code simple. We almost write the parser like we had a high level language dedicated for parsing. All these functions are “mutually recursive”, e.g.: there’s a parse_atom() function which is the main dispatcher — based on the current token it calls other functions. One of them is parse_if() (called when the current token is if) and that in turn calls parse_expression(). But parse_expression() calls parse_atom(). The reason why there’s no infinite loop is that at each step, one function or another will advance at least one token. This kind of parser is called a “recursive descent parser” and it’s probably the easiest kind to write manually. **Lower level: parse_atom() and parse_expression() ** parse_atom() does the main dispatching job, depending on the current token:

function parse_atom() {
    return maybe_call(function(){
        if (is_punc("(")) {
            input.next();
            var exp = parse_expression();
            skip_punc(")");
            return exp;
        }
        if (is_punc("{")) return parse_prog();
        if (is_kw("if")) return parse_if();
        if (is_kw("true") || is_kw("false")) return parse_bool();
        if (is_kw("lambda") || is_kw("λ")) {
            input.next();
            return parse_lambda();
        }
        var tok = input.next();
        if (tok.type == "var" || tok.type == "num" || tok.type == "str")
            return tok;
        unexpected();
    });
}

If it sees an open paren, then it must be a parenthesized expression — thus, skip over paren, call parse_expression() and expect a closing paren. If it sees some keyword, it calls the appropriate parser function. If it sees a constant or an identifier, it’s just returned as is. And if nothing works, unexpected() will throw an error. When an atomic expression is expected and it sees {, it calls parse_prog to parse a sequence of expressions. That’s defined below. It will do some minor optimization at this point — if the prog is empty, then it just returns FALSE. If it has a single expression, it is returned instead of a “prog” node. Otherwise it returns a “prog” node containing the expressions.

// we're going to use the FALSE node in various places,
// so I'm making it a global.
var FALSE = { type: "bool", value: false };

function parse_prog() {
    var prog = delimited("{", "}", ";", parse_expression);
    if (prog.length == 0) return FALSE;
    if (prog.length == 1) return prog[0];
    return { type: "prog", prog: prog };
}

Here’s the parse_expression() function. Contrary to parse_atom(), this one will extend an expression as much as possible to the right using maybe_binary(), which is explained below.

function parse_expression() {
    return maybe_call(function(){
        return maybe_binary(parse_atom(), 0);
    });
}

The maybe_ functions*

These functions check what follows after an expression in order to decide whether to wrap that expression in another node, or just return it as is. maybe_call() is very simple. It receives a function that is expected to parse the current expression. If after that expression it sees a ( punctuation token, then it must be a “call” node, which is what parse_call() makes (included below). Notice again how delimited() comes in handy for reading the argument list.

function maybe_call(expr) {
    expr = expr();
    return is_punc("(") ? parse_call(expr) : expr;
}

function parse_call(func) {
    return {
        type: "call",
        func: func,
        args: delimited("(", ")", ",", parse_expression)
    };
}

Operator precedence

maybe_binary(left, my_prec) is used to compose binary expressions like 1 + 2 * 3. The trick to parse them properly is to correctly define the operator precedence, so we’ll start with that:

var PRECEDENCE = {
    "=": 1,
    "||": 2,
    "&&": 3,
    "<": 7, ">": 7, "<=": 7, ">=": 7, "==": 7, "!=": 7,
    "+": 10, "-": 10,
    "*": 20, "/": 20, "%": 20,
};

This says that * binds tighter than +, so an expression like 1 + 2 * 3 must be read as (1 + (2 * 3)) instead of ((1 + 2) * 3), which would be the normal left-to-right order in which the parser operates. The trick is to read an atomic expression (only the 1) and pass it to maybe_binary() (the left argument), along with the current precedence (the my_prec). maybe_binary will look at what follows. If it doesn’t see an operator, or if it has a smaller priority, then left is returned as is. If it’s an operator that has a higher precedence than ours, then it wraps left in a new “binary” node, and for the right side it repeats the trick at the new precedence level (*):

function maybe_binary(left, my_prec) {
    var tok = is_op();
    if (tok) {
        var his_prec = PRECEDENCE[tok.value];
        if (his_prec > my_prec) {
            input.next();
            var right = maybe_binary(parse_atom(), his_prec) // (*);
            var binary = {
                type     : tok.value == "=" ? "assign" : "binary",
                operator : tok.value,
                left     : left,
                right    : right
            };
            return maybe_binary(binary, my_prec);
        }
    }
    return left;
}

Note that before returning the binary expression we must also call maybe_binary at the old precedence level (my_prec), in order to wrap the expression in another one, should an operator with a higher precedence follow. If all this is confusing, read the code again and again (perhaps try to execute it mentally on some input expressions) until you get it. Finally, since my_prec is initially zero, any operator will trigger the building of a “binary” node (or “assign” when the operator is =). There are a few more functions in the parser, so I’m including the whole parse function below. Click “Show code” to display it (about 150 lines).

var FALSE = { type: "bool", value: false };
function parse(input) {
    var PRECEDENCE = {
        "=": 1,
        "||": 2,
        "&&": 3,
        "<": 7, ">": 7, "<=": 7, ">=": 7, "==": 7, "!=": 7,
        "+": 10, "-": 10,
        "*": 20, "/": 20, "%": 20,
    };
    return parse_toplevel();
    function is_punc(ch) {
        var tok = input.peek();
        return tok && tok.type == "punc" && (!ch || tok.value == ch) && tok;
    }
    function is_kw(kw) {
        var tok = input.peek();
        return tok && tok.type == "kw" && (!kw || tok.value == kw) && tok;
    }
    function is_op(op) {
        var tok = input.peek();
        return tok && tok.type == "op" && (!op || tok.value == op) && tok;
    }
    function skip_punc(ch) {
        if (is_punc(ch)) input.next();
        else input.croak("Expecting punctuation: \"" + ch + "\"");
    }
    function skip_kw(kw) {
        if (is_kw(kw)) input.next();
        else input.croak("Expecting keyword: \"" + kw + "\"");
    }
    function skip_op(op) {
        if (is_op(op)) input.next();
        else input.croak("Expecting operator: \"" + op + "\"");
    }
    function unexpected() {
        input.croak("Unexpected token: " + JSON.stringify(input.peek()));
    }
    function maybe_binary(left, my_prec) {
        var tok = is_op();
        if (tok) {
            var his_prec = PRECEDENCE[tok.value];
            if (his_prec > my_prec) {
                input.next();
                return maybe_binary({
                    type     : tok.value == "=" ? "assign" : "binary",
                    operator : tok.value,
                    left     : left,
                    right    : maybe_binary(parse_atom(), his_prec)
                }, my_prec);
            }
        }
        return left;
    }
    function delimited(start, stop, separator, parser) {
        var a = [], first = true;
        skip_punc(start);
        while (!input.eof()) {
            if (is_punc(stop)) break;
            if (first) first = false; else skip_punc(separator);
            if (is_punc(stop)) break;
            a.push(parser());
        }
        skip_punc(stop);
        return a;
    }
    function parse_call(func) {
        return {
            type: "call",
            func: func,
            args: delimited("(", ")", ",", parse_expression),
        };
    }
    function parse_varname() {
        var name = input.next();
        if (name.type != "var") input.croak("Expecting variable name");
        return name.value;
    }
    function parse_if() {
        skip_kw("if");
        var cond = parse_expression();
        if (!is_punc("{")) skip_kw("then");
        var then = parse_expression();
        var ret = {
            type: "if",
            cond: cond,
            then: then,
        };
        if (is_kw("else")) {
            input.next();
            ret.else = parse_expression();
        }
        return ret;
    }
    function parse_lambda() {
        return {
            type: "lambda",
            vars: delimited("(", ")", ",", parse_varname),
            body: parse_expression()
        };
    }
    function parse_bool() {
        return {
            type  : "bool",
            value : input.next().value == "true"
        };
    }
    function maybe_call(expr) {
        expr = expr();
        return is_punc("(") ? parse_call(expr) : expr;
    }
    function parse_atom() {
        return maybe_call(function(){
            if (is_punc("(")) {
                input.next();
                var exp = parse_expression();
                skip_punc(")");
                return exp;
            }
            if (is_punc("{")) return parse_prog();
            if (is_kw("if")) return parse_if();
            if (is_kw("true") || is_kw("false")) return parse_bool();
            if (is_kw("lambda") || is_kw("λ")) {
                input.next();
                return parse_lambda();
            }
            var tok = input.next();
            if (tok.type == "var" || tok.type == "num" || tok.type == "str")
                return tok;
            unexpected();
        });
    }
    function parse_toplevel() {
        var prog = [];
        while (!input.eof()) {
            prog.push(parse_expression());
            if (!input.eof()) skip_punc(";");
        }
        return { type: "prog", prog: prog };
    }
    function parse_prog() {
        var prog = delimited("{", "}", ";", parse_expression);
        if (prog.length == 0) return FALSE;
        if (prog.length == 1) return prog[0];
        return { type: "prog", prog: prog };
    }
    function parse_expression() {
        return maybe_call(function(){
            return maybe_binary(parse_atom(), 0);
        });
    }
}

Credit

The moment I understood how to write a non-trivial parser occurred while studying Marijn Haverbeke’s parse-js library (Common Lisp). The parser above, although for a much simpler language, is modeled after his code.


Summary

To recap: so far we wrote 3 functions: InputStream, TokenStream and parse. To get an AST from a piece of code now we can do the following:

1
var ast = parse(TokenStream(InputStream(code)));

Writing an interpreter is easier than the parser. We just have to walk the AST, executing expressions in their normal order.

The environment

The key to correct execution is to properly maintain the environment — a structure holding variable bindings. It will be passed as an argument to our evaluate function. Each time we enter a “lambda” node we must extend the environment with new variables (function’s arguments) and initialize them with values passed at run time. If an argument shadows a variable from the outer scope (I’ll use words scope and environment interchangeably here) we must be careful to restore the previous value when we leave the function.

The simplest way to implement this is using JavaScript’s prototype inheritance. When we enter a function we’ll create a new environment, set its prototype to the outer (parent) environment and evaluate the function body in the new one. This way when we exit we need not do anything — the outer env will already contain any shadowed bindings.

Here’s the definition of the Environment object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
function Environment(parent) {
this.vars = Object.create(parent ? parent.vars : null);
this.parent = parent;
}
Environment.prototype = {
extend: function() {
return new Environment(this);
},
lookup: function(name) {
var scope = this;
while (scope) {
if (Object.prototype.hasOwnProperty.call(scope.vars, name))
return scope;
scope = scope.parent;
}
},
get: function(name) {
if (name in this.vars)
return this.vars[name];
throw new Error("Undefined variable " + name);
},
set: function(name, value) {
var scope = this.lookup(name);
// let's not allow defining globals from a nested environment
if (!scope && this.parent)
throw new Error("Undefined variable " + name);
return (scope || this).vars[name] = value;
},
def: function(name, value) {
return this.vars[name] = value;
}
};

An Environment object has a parent, which points to the parent scope. The parent will be null for the global scope. And it has a vars property which holds the variable bindings. This is initialized as Object.create(null) for the toplevel (global) scope, or Object.create(parent.vars) for subscopes, in order to “see” the current bindings via prototypal inheritance.

There are the following methods:

  • extend() — to create a subscope.

  • lookup(name) — to find the scope where the variable with the given name is defined.

  • get(name) — to get the current value of a variable. Throws an error if the variable is not defined.

  • set(name, value) — to set the value of a variable. This needs to lookup the actual scope where the variable is defined. If it’s not found and we’re not in the global scope, throws an error.

  • def(name, value) — this creates (or shadows, or overwrites) a variable in the current scope.

The evaluate function

Now that we have the Environment we can jump to the main problem. This function will be a big switch statement, dispatching by node type, containing logic for evaluating each kind of node. I’ll comment on each case:

1
2
function evaluate(exp, env) {
switch (exp.type) {

For constant nodes, we just return their value:

1
2
3
4
case "num":
case "str":
case "bool":
return exp.value;

Variables are fetched from the environment. Remember that “var” tokens contain the name in the value property:

1
2
case "var":
return env.get(exp.value);

For assignment, we need to check if the left side is a “var” token (if not, throw an error; we don’t support assignment to anything else for now). Then we use env.set to set the value. Note that the value needs to be computed first by calling evaluate recursively.

1
2
3
4
case "assign":
if (exp.left.type != "var")
throw new Error("Cannot assign to " + JSON.stringify(exp.left));
return env.set(exp.left.value, evaluate(exp.right, env));

A “binary” node needs to apply an operator to two operands. We’ll write the apply_op function later, it’s quite trivial. Again, we need to call the evaluator recursively to compute the left and right operands:

1
2
3
4
case "binary":
return apply_op(exp.operator,
evaluate(exp.left, env),
evaluate(exp.right, env));

A “lambda” node will actually result in a JavaScript closure, so it will be callable from JavaScript just like an ordinary function. I wrote a make_lambda function, which I’ll define later:

1
2
case "lambda":
return make_lambda(env, exp);

Evaluating an “if” node is simple: first evaluate the condition. If it’s not false then evaluate the “then” branch and return its value. Otherwise, evaluate the “else” branch, if present, or return false.

1
2
3
4
case "if":
var cond = evaluate(exp.cond, env);
if (cond !== false) return evaluate(exp.then, env);
return exp.else ? evaluate(exp.else, env) : false;

A “prog” is a sequence of expressions. We just evaluate them in order and return the value of the last one. For an empty sequence, the return value is initialized to false.

1
2
3
4
case "prog":
var val = false;
exp.prog.forEach(function(exp){ val = evaluate(exp, env) });
return val;

For a “call” node we need to call a function. First we evaluate the func, which should return a normal JS function, then we evaluate the args and apply that function.

1
2
3
4
5
case "call":
var func = evaluate(exp.func, env);
return func.apply(null, exp.args.map(function(arg){
return evaluate(arg, env);
}));

We should never get here, but just in case we add new node types in the parser and we forget to update the evaluator, let’s throw a clear error.

1
2
3
4
      default:
throw new Error("I don't know how to evaluate " + exp.type);
}
}

This was the core of the evaluator and as you can see it’s really simple. We still need to write two more functions, let’s start with apply_op as it’s the easiest one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
function apply_op(op, a, b) {
function num(x) {
if (typeof x != "number")
throw new Error("Expected number but got " + x);
return x;
}
function div(x) {
if (num(x) == 0)
throw new Error("Divide by zero");
return x;
}
switch (op) {
case "+" : return num(a) + num(b);
case "-" : return num(a) - num(b);
case "*" : return num(a) * num(b);
case "/" : return num(a) / div(b);
case "%" : return num(a) % div(b);
case "&&" : return a !== false && b;
case "||" : return a !== false ? a : b;
case "<" : return num(a) < num(b);
case ">" : return num(a) > num(b);
case "<=" : return num(a) <= num(b);
case ">=" : return num(a) >= num(b);
case "==" : return a === b;
case "!=" : return a !== b;
}
throw new Error("Can't apply operator " + op);
}

It receives the operator and the arguments. Just a boring switch to apply it. Unlike JavaScript, which applies any operator to any arguments and moves on whether that makes any sense or not, we require that the operands for numeric operators be numbers, and that a divizor is not zero, using the small helpers num and div. For strings we’ll define something else.

make_lambda

The make_lambda is a bit subtle:

1
2
3
4
5
6
7
8
9
10
function make_lambda(env, exp) {
function lambda() {
var names = exp.vars;
var scope = env.extend();
for (var i = 0; i < names.length; ++i)
scope.def(names[i], i < arguments.length ? arguments[i] : false);
return evaluate(exp.body, scope);
}
return lambda;
}

As you can see, it returns a plain JavaScript function that encloses over the environment and the expression to evaluate. It’s important to understand that nothing happens when this closure is created — but when it’s called, it will extend the environment that it saved at creation time with the new bindings of arguments/values (if less values are passed than the function’s argument list, the missing ones will get the value false). And then it just evaluates the body in the new scope.

primitive functions

You can observe that our language does not provide any means to interact with the outside world. In some code examples I’ve used some print and println functions, but they are not defined anywhere. These have to be defined as primitive functions (that is, we’ll write them in JavaScript and will insert them into the global environment).

To put it all together now, here’s a test program:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// some test code here
var code = "sum = lambda(x, y) x + y; print(sum(2, 3));";

// remember, parse takes a TokenStream which takes an InputStream
var ast = parse(TokenStream(InputStream(code)));

// create the global environment
var globalEnv = new Environment();

// define the "print" primitive function
globalEnv.def("print", function(txt){
console.log(txt);
});

// run the evaluator
evaluate(ast, globalEnv); // will print 5

Test:

1
echo 'sum = lambda(x, y) x + y; println(sum(2, 3));' | node lambda-eval1.js
  • Post title:[Parser] Build: PL By:JS
  • Post author:ReZero
  • Create time:2017-12-02 18:42:56
  • Post link:https://rezeros.github.io/2017/12/02/parser-build-pl-byjs/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments