Implementing a Functional Language I: The Core Language
This is part 1 of a series in implementing a functional language. The introduction is here. This is a literate Haskell file - you can download the source here. To load it into GHCi and play around, you can use the following command:
stack --resolver lts-12.2 \
ghci --package prettyprinter \
--package text \
2019-03-03-the-core-language.lhs
{-# LANGUAGE OverloadedStrings #-}
module Core.Language where
import Data.Text (Text)
The Core Language
Let’s look at the language we’re going to compile. It should look familiar if you’ve used Haskell or another ML-like language before.
This is a simple Core program that evaluates to 6:
main = addTwo 4 ;
addTwo n = n + 2
This is a more complex program showcasing all of Core’s features:
main = let f = g Pack{1,0}
in f 4 ;
-- Lines starting like this are comments
g a b = case a of
<1> -> b + 1 ;
<2> -> b + 2
We’ll briefly describe each in turn.
Expressions and application
All valid constructs are expressions - they evaluate to a result. Functions are applied to
arguments via juxtaposition: f 1 2
. A Core program consists of a series of
definitions. Definitions are delineated by a semicolon ;
. Semicolons are also used
where necessary in other constructs. You’ll see examples of this as we go on.
Global definitions
Functions can be defined at the global level, like g
in the example above. The syntax is the
same as any ML-like language.
functionName arg1 ... argN = functionBody
where arg1 ... argN
are bound in functionBody
.
Values can be defined similarly, resembling nullary functions.
five = 5 ;
seven = five + 2
A valid Core program must have a nullary top level definition called main
. This is the
entrypoint of the program.
Local definitions
let
statements allow defining values in a local scope.
f = let a = 4 ;
double x = x + x
in double a
letrec
statements are similar, but allow recursive definitions. We make the distinction
between the two because non-recursive let
s are often easier to implement and we can take
an approach that leads to more performant code than the recursive case1.
Data Types
In Haskell, a data type might be defined as follows:
data Colour = Red | Green | Blue
data Coord = MkCoord Int Int
data Tree a
= Leaf a
| Branch (Tree a)
(Tree a)
The first case defines a type with three distinct nullary constructors. The second
case defines a type with a single constructor taking two arguments. The third case
combines these two, and is also parameterised over an abstract type a
.
In Core we assume that typechecking has already happened, and so we don’t need to
distinguish between constructors of different types. The constructor names are also not
important2. Core doesn’t support polymorphism, so type parameters don’t need to be
considered. As such, we can simplify data type declaration. In Core, all type constructors
take the form Pack{t,n}
where t
is the tag of the constructor and n
is the arity
of the constructor. The tag is an integer starting at 1 and distinguishes between
constructors of the same type. The arity of a constructor is the number of arguments it
takes. For example,
data Colour = Red | Green | Blue
can be represented as
Red = Pack{1,0} ;
Green = Pack{2,0} ;
Blue = Pack{3,0}
Similarly,
data Tree
= Leaf Int
| Branch Tree Tree
translates to
Leaf = Pack{1,1} ;
Branch = Pack{2,2}
Case expressions
We construct data with Pack
, and deconstruct it with case
. This is the only way to
inspect data in Core, and subsumes all forms of pattern matching in the higher level
language (e.g. multiple function definitions)3. Case expressions look like this:
isLeaf tree = case tree of
<1> v -> True ;
<2> t1 t2 -> False
A case
expression consists of a scrutinee (tree
in the example) followed by one or more
alternatives separated by a semicolon. Alternatives take the form
<t> x1 ... xn -> e
where t
is the tag of the constructor and x1 ... xn
are the arguments to the
constructor. The number of arguments must match the arity of the constructor. If the tag
of the constructor on the scrutinee matches the tag on the alternative, the whole case
expression evaluates to e
. Every constructor must have a matching alternative - the
behaviour when a scrutinee’s tag has no corresponding alternative is not defined (our
program will probably just crash).
Built-in primitives
Core has built in support for integers, and a small set of arithmetic and boolean operations defined as binary functions. The representation of Booleans themselves will vary between implementations - often we will map them to the integers 0 and 1.
Operation | Symbol |
---|---|
Addition | + |
Subtraction | - |
Multiplication | * |
Division | / |
Equality | == |
Greater than | > |
Greater than or equal to | >= |
Less than | < |
Less than or equal to | <= |
And | & |
Or | | |
Integer negation is performed via the primitive function negate
.
Lambda abstractions
Notably absent from Core is lambda abstraction, or anonymous functions. Functions are always supercombinators. This is an explicit choice to simplify the implementation, and we do not lose any expressive power as a result. Lambda abstractions can be mechanically removed from a program through a process known as lambda lifting. We will cover this in more detail in later section.
Modelling the language
Each implementation of the compiler will take as input a representation of a Core program, so we need to define what that is. Everything else will be built around this type.
data Expr a
= EVar Name -- variables
| ENum Int -- numbers
| EConstr -- constructor
Int -- tag
Int -- arity
| EAp (Expr a) -- applications
Expr a)
(| ELet -- let(rec) expressions
Recursive -- recursive (letrec) or nonrecursive (let)
Expr a)] -- definitions
[(a, Expr a) -- body
(| ECase -- case expressions
Expr a) -- expression to scrutinise
(Alter a] -- alternatives
[deriving (Show)
data Recursive = Recursive | NonRecursive
deriving (Show)
type Name = Text
Expr
describes an expression in the Core language. It is parameterised over the type of
its binders - a binder is a name given to a variable on the left hand side of a
let(rec)
expression or function definition. This will allow us to model variable binding
more sophisticatedly in later parts without changing the definition of Expr
. For now we
can make do with simple binders, so we define a type synonym for convenience.
type CoreExpr = Expr Name
Function application is modelled by the EAp
constructor. Applications of more than one
argument are transformed into nested EAp
nodes.
f 1 2
-- becomes
EAp (EAp (EVar "f") (ENum 1)) (ENum 2)
Case expressions consist of a scrutinee and a list of alternatives, modelled by Alter
,
which is a tuple of the constructor tag, the constructor arguments and the result
expression.
type Alter a = (Int, [a], Expr a)
-- 0 1 2
-- case s of
-- [0] [1] [2]
-- <1> x y -> e
A supercombinator is a function with no free variables - in Core all global definitions are supercombinators.
type ScDefn a = (Name, [a], Expr a)
type CoreScDefn = ScDefn Name
A Core program is a collection of supercombinator definitions, one of which is called “main”.
type Program a = [ScDefn a]
type CoreProgram = Program Name
We can now define a complete Core program, as an example.
-- main = double 21 ;
-- double x = x + x
exampleProgram :: CoreProgram
exampleProgram= [
"main", [], EAp (EVar "double") (ENum 21))
("double", ["x"], EAp (EAp (EVar "+") (EVar "x")) (EVar "x"))
, ( ]
Parsing and Pretty Printing
We want to be able to parse and pretty print Core, but this is not a parsing or pretty
printing tutorial. Instead of going through this in detail, we will rely on the
Parse
and Print
modules. Each of these are defined in literate
Haskell files and you can read them if you want to. However I’d recommend seeking out a
proper introduction to these topics if you’re not familiar with them. If I come across or
recall any good ones I may link to them here for reference. The book also has an
extensive section on each of these topics, though the style is somewhat outdated.
The higher-level source language might have a single unified
let
construct, with an analysis phase to determine which cases must be converted toletrec
in Core.↩︎Except for providing useful error messages. Since you’d have to build this support into the compiler for your higher level language, we don’t consider it here.↩︎
Compiling more expressive pattern matching into case expressions is a well studied problem and there are established algorithms for doing so. For more information on this, see Chapter 5 of The Implementation of Functional Programming Languages.↩︎