Compilers

1. Policy
- 1.1. Evaluation
- 1.2. Resources
2. Lectures

1. Policy

Attend lectures only if you want to. You are expected to keep up to date on the discussions that happen during lectures and labs.

1.1. Evaluation

Assessment	Percentage	Remarks
Lab quizzes	30	Small programming exercises
Semester-long project	50	Team of 4.
Mid-semester exam	10
End-semester exam	10

1.2. Resources

There is no textbook. Follow the lectures. Code a lot. Read necessary theory.

Bob Nystrom's Crafting Interpreters is a good book. I will disagree with this book on some of the topics.

2. Lectures

2.1. Calculator back-end

You have all used compilers and you know what they do. In this course, we will build one and explore how compilers are organized and the relevant algorithms.

The most common data structure in a compiler is a tree and the most common algorithm, the post-order traversal. To see why, let us build a simple calculator. It accepts expressions and evaluates them. What is the best way to represent an expression like 2 + 3*5 in a computer?


  +
 / \
2   *
   / \
  3   5

Yes. A tree. How would you evaluate this expression? Evaluate the left sub-tree, then the right sub-tree, then add the results. That is, a post-order traversal.

The front-end of the compiler has to take the string 2 + 3*5 and build this tree. This is done using two classes of algorihms called lexing and parsing, which correspond to DFAs and CFGs respectively. We will see this part later. First, we will build the back-end for the compiler. It takes trees as input and evaluates them.

First, let us define some classes in Python to represent expression trees.

from dataclasses import dataclass

class AST:
    pass

@dataclass
class BinOp(AST):
    op: str
    left: AST
    right: AST

@dataclass
class Number(AST):
    val: str

The tree for the previous expression can be created so:

expr_t1 = BinOp("+", Number("2"), BinOp("*", Number("3"), Number("5")))
print(expr_t1)

BinOp(op='+', left=Number(val='2'), right=BinOp(op='*', left=Number(val='3'), right=Number(val='5')))

Now, let us write an evaluator. This function should take the tree as input and produce a number.

def e(tree: AST) -> int:
    match tree:
        case Number(v): return int(v)
        case BinOp("+", l, r): return e(l) + e(r)
        case BinOp("*", l, r): return e(l) * e(r)

Let us test it by evaluating our sample expression.

e(expr_t1)

Yay! Now, extend this small program to make what you think a real calculator should be.

2.2. Calculator front-end

The front-end of the calculator should take a string as input and return the corresponding expression tree. We split this into a lexer and a parser.

The lexer converts a stream of characters into a stream of tokens. First, we define the tokens of our language.

class Token:
    pass

@dataclass
class NumberToken(Token):
    v: str

@dataclass
class OperatorToken(Token):
    o: str

Now, the lexer can produce the token stream.

from collections.abc import Iterator
def lex(s: str) -> Iterator[Token]:
    i = 0
    while True:
        while i < len(s) and s[i].isspace():
            i = i + 1

        if i >= len(s):
            return

        if s[i].isdigit():
            t = s[i]
            i = i + 1
            while i < len(s) and s[i].isdigit():
                t = t + s[i]
                i = i + 1
            yield NumberToken(t)
        else:
            match t := s[i]:
                case '+' | '*':
                    i = i + 1
                    yield OperatorToken(t)

Let us test our lexer with our sample expression.

for t in lex("2 + 3*5"):
    print(t)

NumberToken(v='2')
OperatorToken(o='+')
NumberToken(v='3')
OperatorToken(o='*')
NumberToken(v='5')

The parser takes the token stream as input and produces the AST.

from collections.abc import Iterator
def parse(s: str) -> AST:
    from more_itertools import peekable
    t = peekable(lex(s))

    def parse_add():
        ast = parse_mul()
        while True:
            match t.peek(None):
                case OperatorToken('+'):
                    next(t)
                    ast = BinOp('+', ast, parse_mul())
                case _:
                    return ast

    def parse_mul():
        ast = parse_atom()
        while True:
            match t.peek(None):
                case OperatorToken('*'):
                    next(t)
                    ast = BinOp("*", ast, parse_atom())
                case _:
                    return ast

    def parse_atom():
        match t.peek(None):
            case NumberToken(v):
                next(t)
                return Number(v)

    return parse_add()

Let us test the parser for some expressions.

print(parse("2"))
print(parse("2+3"))
print(parse("2+3*5"))

Number(val='2')
BinOp(op='+', left=Number(val='2'), right=Number(val='3'))
BinOp(op='+', left=Number(val='2'), right=BinOp(op='*', left=Number(val='3'), right=Number(val='5')))

To put the front-end and back-end together, we just have to compose parse and e.

print(e(parse("2 + 3*5")))

Now is the time to extend this into a more useful calculator. I suggest adding the following features.

All usual arithmetic operators, including the unary negation.
Parenthesis for grouping.
Use a proper number type for evaluation.
Error-handling in the front-end.

2.3. Conditional expressions

We will implement a conditional expression of the form if a then b else c end. To do this, we need the following additions:

Lexing: Add support for keywords like if.
Parsing: Add support for conditional expressions. This has a lower precedence than arithmetic.
Evaluation: Modify the post-order traversal to conditionally traverse sub-trees.

As before, we will start by defining the AST node.

@dataclass
class If(AST):
    c: AST
    t: AST
    e: AST

We will use BinOp for comparison operators.

We want to handle expressions like:

expr_t2 = If(BinOp("<", Number("2"), Number("3")), Number("0"), Number("1"))
expr_t3 = If(BinOp("<", Number("3"), Number("2")), Number("0"), Number("1"))

The evaluator is:

def e(tree: AST) -> int | bool:
    match tree:
        case Number(v): return int(v)
        case BinOp("+", l, r): return e(l) + e(r)
        case BinOp("*", l, r): return e(l) * e(r)
        case BinOp("<", l, r): return e(l) < e(r)
        case If(cond, then, else_):
            if e(cond):
                return e(then)
            else:
                return e(else_)

Let us test it with our sample expression.

print(e(expr_t2))
print(e(expr_t3))

0
1

Now, we will modify the front-end to allow the programmer to input these expressions. We will need a token type for keywords.

@dataclass
class KeywordToken(Token):
    w: str

In lex, if the first character is a letter, then we will try and form a keyword token. We should also handle comparison operators.

from collections.abc import Iterator
def lex(s: str) -> Iterator[Token]:
    i = 0
    while True:
        while i < len(s) and s[i].isspace():
            i = i + 1

        if i >= len(s):
            return

        if s[i].isalpha():
            t = s[i]
            i = i + 1
            while i < len(s) and s[i].isalpha():
                t = t + s[i]
                i = i + 1
            # XXX: Should check here whether we got a valid keyword.
            yield KeywordToken(t)
        elif s[i].isdigit():
            t = s[i]
            i = i + 1
            while i < len(s) and s[i].isdigit():
                t = t + s[i]
                i = i + 1
            yield NumberToken(t)
        else:
            match t := s[i]:
                case '+' | '*' | '<':
                    i = i + 1
                    yield OperatorToken(t)

Let us see what tokens are produced by our new lexer.

for t in lex("if 2 < 3 then 0 else 1 end"):
    print(t)

KeywordToken(w='if')
NumberToken(v='2')
OperatorToken(o='<')
NumberToken(v='3')
KeywordToken(w='then')
NumberToken(v='0')
KeywordToken(w='else')
NumberToken(v='1')
KeywordToken(w='end')

It looks fine. Now, we move on to parsing.

While parsing, we will peek at the first token. If it is the keyword if, then we will try to parse an if expression. If it begins with another token, we will try and parse an arithmetic expression.

class ParseError(Exception):
    pass

def parse(s: str) -> AST:
    from more_itertools import peekable
    t = peekable(lex(s))

    def expect(what: Token):
        if t.peek(None) == what:
            next(t)
            return
        raise ParseError

    def parse_if():
        match t.peek(None):
            case KeywordToken("if"):
                next(t)
                cond = parse_if()
                expect(KeywordToken("then"))
                then = parse_if()
                expect(KeywordToken("else"))
                else_ = parse_if()
                expect(KeywordToken("end"))
                return If(cond, then, else_)
            case _:
                return parse_cmp()

    def parse_cmp():
        l = parse_add()
        if t.peek(None) == OperatorToken('<'):
            next(t)
            r = parse_add()
            return BinOp('<', l, r)
        else:
            return l

    def parse_add():
        ast = parse_mul()
        while True:
            match t.peek(None):
                case OperatorToken('+'):
                    next(t)
                    ast = BinOp('+', ast, parse_mul())
                case _:
                    return ast

    def parse_mul():
        ast = parse_atom()
        while True:
            match t.peek(None):
                case OperatorToken('*'):
                    next(t)
                    ast = BinOp("*", ast, parse_atom())
                case _:
                    return ast

    def parse_atom():
        match t.peek(None):
            case NumberToken(v):
                next(t)
                return Number(v)

    return parse_if()

Let us test it on a sample.

print(parse("if 2 < 3 then 0+5 else 1*6 end"))

If(c=BinOp(op='<', left=Number(val='2'), right=Number(val='3')), t=BinOp(op='+', left=Number(val='0'), right=Number(val='5')), e=BinOp(op='*', left=Number(val='1'), right=Number(val='6')))

2.4. Variables

Consider the following emacs lisp example:

(defvar v 1)

(defun foo ()
  (princ v))

(defun bar ()
  (let ((v 10))
    (foo)))

(bar)

Compare it with the following Python code that is the same except in the scoping of variables.

v = 1

def foo():
    print(v)

def bar():
    v = 10
    foo()

bar()

The v in foo() refers to the global v in Python. The v in bar() is local to that function and cannot escape unless it is passed as an argument to foo(). This is called lexical scoping. The association of variable use to declaration is determined by program text. Emacs lisp uses dynamic scoping, where the association is determined by the state of execution.

(defvar v 1)

(defun foo ()
  (print v))

(defun bar (a)
  (if a (let ((v 10)) (foo)) (foo)))

(bar nil)
(bar t)


1

10

In the following example, observe that foo() picks up either the global v or the v local to bar() depending on the argument to bar().


1

10

Modern languages all use lexical scoping by default. But, an option to have dynamic variables is also nice.

2.5. Dynamic scoping

Implementing dynamic scoping is easy. During evaluation, we keep a stack that maps variable names to values. Each variable declaration pushes an entry into this stack and each variable use looks for the variable in this stack from top to bottom. This will pick-up the most recently declared variable of the same name.

If your language has no functions, then there is no difference between lexical and dynamic scoping.

The following class represents the programming construct let v be e in f end which binds the value e to v and then evaluates f. Here, v must be a valid identifier and e and f must be expressions.

@dataclass
class Let(AST):
    v: str
    e: AST
    f: AST

We also need a class for occurences of variables within expressions.

@dataclass
class Var(AST):
    v: str

The evaluation now requires a stack in addition to the AST. Usually, this stack is called the environment.

def lookup(env, v):
    for u, uv in reversed(env):
        if u == v:
            return uv
    raise ValueError("No value found.")

def e(t: AST, env = None) -> int:
    if env is None: env = []

    match t:
        case Var(v):
            return lookup(env, v)
        case Let(v, x, y):
            vv = e(x, env)
            env.append((v, vv))
            yv = e(y, env)
            env.pop()
            return yv
        case Number(s):
            return int(s)
        case BinOp("+", l, r):
            return  e(l, env) + e(r, env)
    assert False

Let us create two sample ASTs and test them.

expr_t4 = Let("a", Number("3"), BinOp("+", Var("a"), Var("a")))
expr_t5 = Let("a", Number("3"),
              Let("b", BinOp("+", Var("a"), Number("2")),
                  BinOp("+", Var("a"), Var("b"))))

print(e(expr_t4))
print(e(expr_t5))

6
8

Now, we will add the surface syntax which will look like let a be 3 in a + a end. I prefer keywords such as be instead of symbols such as = or := as words are easier to type for me. However, when choosing a keyword, we have to ensure that it is a word that will not be used often as variable names. The word be is not a common variable name. So this choice is fine.

The lexer should have variable tokens.

@dataclass
class VarToken(Token):
    v: str

After lexing a word, we have to lookup whether it is a keyword. If it is, then we return a keyword token. Otherwise, it is a variable.

from collections.abc import Iterator

def lex(s: str) -> Iterator[Token]:
    i = 0
    while True:
        while i < len(s) and s[i].isspace():
            i = i + 1

        if i >= len(s):
            return

        if s[i].isalpha():
            t = s[i]
            i = i + 1
            while i < len(s) and s[i].isalpha():
                t = t + s[i]
                i = i + 1
            if t in { "let", "be", "in", "end" }:
                yield KeywordToken(t)
            else:
                yield VarToken(t)
        elif s[i].isdigit():
            t = s[i]
            i = i + 1
            while i < len(s) and s[i].isdigit():
                t = t + s[i]
                i = i + 1
            yield NumberToken(t)
        else:
            match t := s[i]:
                case '+' | '*' | '<':
                    i = i + 1
                    yield OperatorToken(t)

Let us test the lexer.

for t in lex("let a be 3 in a + a end"):
    print(t)

KeywordToken(w='let')
VarToken(v='a')
KeywordToken(w='be')
NumberToken(v='3')
KeywordToken(w='in')
VarToken(v='a')
OperatorToken(o='+')
VarToken(v='a')
KeywordToken(w='end')

The parser tries to parse a let expression if it observes the token let at the beginning.

class ParseError(Exception):
    msg: str

def parse(s: str) -> AST:
    from more_itertools import peekable
    t = peekable(lex(s))

    def expect(what: Token):
        if t.peek(None) == what:
            next(t)
            return
        raise ParseError(f"Expected {what}")

    def parse_let():
        match t.peek(None):
            case KeywordToken("let"):
                next(t)
                vt = next(t)
                expect(KeywordToken("be"))
                e = parse_let()
                expect(KeywordToken("in"))
                f = parse_let()
                expect(KeywordToken("end"))
                return Let(vt.v, e, f)
            case _:
                return parse_add()

    def parse_add():
        ast = parse_atom()
        while True:
            match t.peek(None):
                case OperatorToken('+'):
                    next(t)
                    ast = BinOp('+', ast, parse_atom())
                case _:
                    return ast

    def parse_atom():
        match t.peek(None):
            case NumberToken(v):
                next(t)
                return Number(v)
            case VarToken(v):
                next(t)
                return Var(v)

    return parse_let()

print(e(parse("let a be 3 in a + a end")))
print(e(parse("let a be 3 in let b be a + 2 in a + b end end")))

6
8

2.6. Functions

Let us implement functions. We need two new AST nodes. One for defining and another for calling functions.

@dataclass
class Fun(AST):
    n: str # Name of function
    a: str # Parameter
    b: AST # Body
    e: AST # Function calls here

@dataclass
class Call(AST):
    n: str # Name of function
    a: AST # Argument

We want to write code such as:

expr_t6 = """
fun dbl(a) is a + a
in
  dbl(2) + dbl(3)
end
"""

to define a function and call it two times with two different arguments. It corresponds to the AST:

expr_t6ast = Fun (
    "dbl",
    "a",
    BinOp("+", Var("a"), Var("a")),
    BinOp("+", Call("dbl", Number("2")), Call("dbl", Number("3")))
)

First, the evaluation. When a function is defined, we parse and put its body into an environment. When it is called, we evaluate the body in an environment where the parameter is bound to the argument.

def lookup(env, v):
    for u, uv in reversed(env):
        if u == v:
            return uv
    raise ValueError("No value found.")

def e(t: AST, env = None) -> int:
    if env is None: env = []

    match t:
        case Let(v, x, y):
            vv = e(x, env)
            env.append((v, vv))
            yv = e(y, env)
            env.pop()
            return yv
        case Var(v):
            return lookup(env, v)
        case Fun(f, a, b, c):
            env.append((f, (a, b)))
            x = e(c, env)
            env.pop()
            return x
        case Call(f, x):
            a, b = lookup(env, f)
            env.append((a, e(x, env)))
            y = e(b, env)
            env.pop()
            return y
        case Number(s):
            return int(s)
        case BinOp("+", l, r):
            return  e(l, env) + e(r, env)

Let us test this on the hand-crafted AST.

print(e(expr_t6ast))

The modifications to the front-end are straight-forward. It is left as an exercise.

2.7. Functions with static scoping

We want to create an AST for the following code:

let x be 5 in
  fun f(y) is x in
    fun g(z) is let x be 6 in f(z) end in
      g(0)
    end
  end
end

and demonstrate that it evaluates to 6 with dynamic scoping. With static scoping, it would have evaluated to 5.

To implement static scoping. We add a resolve() pass that renames variables so that each variable has a unique name. For this, we redefine the Variable class.

@dataclass
class Var(AST):
    v: str
    i: int = None

where the i field holds the integer that will help us resolve variables to bindings.

We redefine some of the AST nodes to store Var instead of just a string.

@dataclass
class Let(AST):
    x: AST # Always a Var
    e: AST
    f: AST

@dataclass
class Fun(AST):
    f: str # Functions are second-class.
    x: AST # Parameter.
    b: AST
    e: AST

The AST for the above expression is:

expr_t7ast = Let (
    Var("x"),
    Number("5"),
    Fun (
        "f",
        Var("y"),
        Var("x"),
        Fun (
            "g",
            Var("z"),
            Let (
                Var("x"),
                Number("6"),
                Call("f", Var("x"))
            ),
            Call("g", Number("0")))))

Under dynamic scoping, we will get 6. With static scoping, we should get 5.

Now, we make a function to create fresh integers everytime it is called. This will help us fill in id of Var correctly.

def make_fresh():
    i = 0
    def fresh():
        nonlocal i
        i = i + 1
        return i
    return fresh

Finally, the resolve() is another post-order traversal over the AST.

def resolve(t: AST, env = None, fresh = None) -> AST:
    if env is None: env = []
    if fresh is None: fresh = make_fresh()

    match t:
        case Number(n):
            return Number(n)
        case Let(Var(x, _), e, f):
            er = resolve(e, env, fresh)
            env.append((x, i := fresh()))
            fr = resolve(f, env, fresh)
            env.pop()
            return Let(Var(x, i), er, fr)
        case Var(x, _):
            return Var(x, lookup(env, x))
        case Call(f, x):
            xr = resolve(x, env, fresh)
            return Call(f, xr)
        case Fun(f, Var(x, _), b, y):
            env.append((x, i := fresh()))
            br = resolve(b, env, fresh)
            env.pop()
            yr = resolve(y, env, fresh)
            return Fun(f, Var(x, i), br, yr)

Observe the AST before resolution.

print(expr_t7ast)

Let(x=Var(v='x', i=None), e=Number(val='5'), f=Fun(f='f', x=Var(v='y', i=None), b=Var(v='x', i=None), e=Fun(f='g', x=Var(v='z', i=None), b=Let(x=Var(v='x', i=None), e=Number(val='6'), f=Call(n='f', a=Var(v='x', i=None))), e=Call(n='g', a=Number(val='0')))))

Every i field is None.

Let us now look at the resolved form of our earlier AST.

print(resolve(expr_t7ast))

Let(x=Var(v='x', i=1), e=Number(val='5'), f=Fun(f='f', x=Var(v='y', i=2), b=Var(v='x', i=1), e=Fun(f='g', x=Var(v='z', i=3), b=Let(x=Var(v='x', i=4), e=Number(val='6'), f=Call(n='f', a=Var(v='x', i=4))), e=Call(n='g', a=Number(val='0')))))

Observe how the x in the body of f gets matched to the x introduced by the outermost let through the i field in Var. Now, we can adjust the e() to lookup values based on both the variable name and its id and obtain the correct variable as in static scoping.

2.8. Bytecode generation

Let us generate bytecode for a calculator. First, we have to define bytecode instructions, their format, and write a VM to execute the bytecode. We will use a simple two-byte format:

First byte is the opcode.
Second byte is the immediate operand if any.

We will write the bytecode VM in C. Let us define the opcodes.

enum op
{
    HALT = 0,
    NOP,
    PUSH,
    POP,
    ADD,
    SUB,
    MUL,
    NEG,
};

The interpreter evaluates and returns the final value on stack.

#include <stdio.h>
#include <stdint.h>
int execute(uint8_t *insns)
{
  size_t ip = 0;
  int operand[100], top = 0;
#define PUSH(x) (operand[top++] = x)
#define POP(x) (operand[--top])	

  int l, r;
  while (1) {
    switch (insns[ip]) {
    case HALT: goto end;
    case NOP: break;
    case PUSH:
      PUSH(insns[ip+1]); break;
    case ADD:
      r = POP(); l = POP(); PUSH(l+r); break;
    case SUB:
      r = POP(); l = POP(); PUSH(l-r); break;
    case MUL:
      r = POP(); l = POP(); PUSH(l*r); break;
    case NEG:
      l = POP(); PUSH(-l); break;
    }
    ip += 2; // No control-flow.
  }
 end:
  return POP();
#undef PUSH
#undef POP
}

Now, let us test the interpreter with some hand-crafted bytecode.

enum op
{
    HALT = 0,
    NOP,
    PUSH,
    POP,
    ADD,
    SUB,
    MUL,
    NEG,
};
#include <stdio.h>
#include <stdint.h>
int execute(uint8_t *insns)
{
  size_t ip = 0;
  int operand[100], top = 0;
#define PUSH(x) (operand[top++] = x)
#define POP(x) (operand[--top])	

  int l, r;
  while (1) {
    switch (insns[ip]) {
    case HALT: goto end;
    case NOP: break;
    case PUSH:
      PUSH(insns[ip+1]); break;
    case ADD:
      r = POP(); l = POP(); PUSH(l+r); break;
    case SUB:
      r = POP(); l = POP(); PUSH(l-r); break;
    case MUL:
      r = POP(); l = POP(); PUSH(l*r); break;
    case NEG:
      l = POP(); PUSH(-l); break;
    }
    ip += 2; // No control-flow.
  }
 end:
  return POP();
#undef PUSH
#undef POP
}
int main()
{
    uint8_t insns[] = {
      PUSH, 2,
      PUSH, 3,
      ADD , 0, // 0 is ignored.
      PUSH, 5,
      MUL , 0,
      HALT, 0,
    };
    printf("%d\n", execute(insns));
    return 0;
}

Now, we just have to take an AST for an expression into an array of bytes and write it out to a file.

HALT, NOP, PUSH, POP, ADD, SUB, MUL, NEG = range(8)

def do_codegen(t: AST, code):
    match t:
        case Number(v): # Careful. I don't handle large numbers here.
            code.append(PUSH)
            code.append(int(v))
        case BinOp("+", l, r):
            do_codegen(l, code)
            do_codegen(r, code)
            code.append(ADD)
            code.append(0)
        case BinOp("-", l, r):
            do_codegen(l, code)
            do_codegen(r, code)
            code.append(SUB)
            code.append(0)
        case BinOp("*", l, r):
            do_codegen(l, code)
            do_codegen(r, code)
            code.append(MUL)
            code.append(0)
    return code

def codegen(t):
    c = do_codegen(t, bytearray())
    c.append(HALT)
    return c

Let us create a sample AST and check the output bytes.

expr_cg = BinOp ("*", BinOp("+", Number("2"), Number("3")), Number("5"))

And, finally:

print(codegen(expr_cg))

bytearray(b'\x02\x02\x02\x03\x04\x00\x02\x05\x06\x00\x00')

Connecting all these pieces by appropriate file-handling is left as an exercise.

2.9. Bytecode execution for call and return

We will write bytecode equivalent of:

fun f(x) is x + 1 in f(42) end

which evaluates to 43 to illustrate CALL and RETURN.

The bytecode instructions are:

class OC:
    JF, PUSH, ADD, CALL, RET, GET, HALT = range(1, 8)

The above code is then compiled into the following bytecode:

sample = [
    (OC.JF, 5), # Jump to after the function body.
    (OC.GET, "x"),
    (OC.PUSH, 1),
    (OC.ADD, None),
    (OC.RET, None),
    (OC.PUSH, 42),
    (OC.GET, "f"),
    (OC.CALL, None),
    (OC.HALT, None)
]

To simplify discussion, I keep them as Python list. The real bytecode will be a sequence of bytes that encodes the above list.

We will discuss the bytecode evaluator first. The values in stack are either numbers or functions. We need objects corresponding to them.

from dataclasses import dataclass

@dataclass
class NumObj:
    n: int

    def __repr__(self):
        return f"{self.n}"

@dataclass
class FunObj:
    e: int # The entry point
    a: str # The parameter name

For executing functions, we need call frames.

@dataclass
class CallFrame:
    f: FunObj # The function object
    v: dict # arguments and locals
    r: int # The return address

After compiling the above bytecode, the compiler should create the following function object to encode f.

f = FunObj(1, "x")

The code generator should create both the bytecode and a dictionary that maps names to corresponding global objects.

@dataclass
class Code:
    b: list # The bytecode
    o: dict # The map

For our sample program:

code = Code(sample, { "f" : f })

Now, the function that actually executes the bytecode. We return the operand stack at the end to observe the final value.

def execute(code):
    ip = 0  # The current instruction.
    op = [] # Operand stack. Holds NumObjs or FunObjs.
    cs = [] # Call stack. Holds CallFrames.

    # We initialize the call stack with call frame of main()
    # In our example, the main() contains all code in the file
    # like in Python.
    c = CallFrame(FunObj(0, "_unused"), {}, None)
    c.v = code.o
    cs.append(c)

    while True:
        match code.b[ip]:
            case (OC.HALT, _):
                return op
            case (OC.JF, off):
                ip += off
            case (OC.PUSH, val):
                op.append(NumObj(val))
                ip += 1
            case (OC.ADD, _):
                left = op.pop()
                right = op.pop()
                op.append(NumObj(left.n+right.n))
                ip += 1
            case (OC.CALL, _):
                f = op.pop()
                x = op.pop()
                c = CallFrame(f, { f.a : x }, ip+1)
                cs.append(c)
                ip = f.e
            case (OC.RET, _):
                ip = cs[-1].r
                cs.pop()
            case (OC.GET, name):
                op.append(cs[-1].v[name])
                ip += 1

Finally, we execute the code:

print(execute(code))

[43]

Let us write the function:

fun f(x) is x+1 in fun g(y) is f(y+2) in g(42) end end

The corresponding bytecode is:

sample2 = [
    (OC.JF, 5),
    (OC.GET, "x"),
    (OC.PUSH, 1),
    (OC.ADD, None),
    (OC.RET, None),
    (OC.JF, 7),
    (OC.GET, "y"),
    (OC.PUSH, 2),
    (OC.ADD, None),
    (OC.GET, "f"),
    (OC.CALL, None),
    (OC.RET, None),
    (OC.PUSH, 42),
    (OC.GET, "g"),
    (OC.CALL, None),
    (OC.HALT, None)
]

The new code is:

code2 = Code(sample2, { "f" : FunObj(1, "x"), "g" : FunObj(6, "y") })

The call to f() in g() can only find the function object for f() in main()'s stack frame. So we need to search backwards when executing GET. This is the only change required in the execute().

def execute2(code):
    ip = 0  # The current instruction.
    op = [] # Operand stack. Holds NumObjs or FunObjs.
    cs = [] # Call stack. Holds CallFrames.

    # We initialize the call stack with call frame of main()
    # In our example, the main() contains all code in the file
    # like in Python.
    c = CallFrame(FunObj(0, "_unused"), {}, None)
    c.v = code.o
    cs.append(c)

    while True:
        match code.b[ip]:
            case (OC.HALT, _):
                return op
            case (OC.JF, off):
                ip += off
            case (OC.PUSH, val):
                op.append(NumObj(val))
                ip += 1
            case (OC.ADD, _):
                left = op.pop()
                right = op.pop()
                op.append(NumObj(left.n+right.n))
                ip += 1
            case (OC.CALL, _):
                f = op.pop()
                x = op.pop()
                c = CallFrame(f, { f.a : x }, ip+1)
                cs.append(c)
                ip = f.e
            case (OC.RET, _):
                ip = cs[-1].r
                cs.pop()
            case (OC.GET, name):
                for i in range(len(cs)-1, -1, -1):
                    if name in cs[i].v:
                        op.append(cs[i].v[name])
                        break
                ip += 1

print(execute2(code2))

[45]

The above kind of variable lookup is why in Python the following is:

def foo():
    global x
    y = 0
    for i in range(10000):
        y += x
    return y

slower than:

def foo():
    global x
    z = x
    y = 0
    for i in range(10000):
        y += z
    return y

2.10. Aside: Discriminating unions

The datatype double + double with a discriminating union is not the same as double. For example, temperature is a double+double and it cannot be collapsed into a double as it will lose the information whether it is in celsius or fahrenheit.

struct temperature {
        enum { CELSIUS, FAHRENHEIT } kind;
        union {
                double c;
                double f;
        };
};

In C, the programmer is expected to only access c when kind is CELSIUS. This is not enforced by the compiler.

The following Python types encode the same information. Unlike C, you can only access c when temperature is celsius.

from dataclasses import dataclass

class Temperature:
    pass

@dataclass
class Celsius(Temperature):
    c: float

@dataclass
class Fahrenheit(Temperature):
    f: float