Implementing a Functional Language I: The Core Language

This is part 1 of a series in implementing a functional language. The introduction is here. This is a literate Haskell file - you can download the source here. To load it into GHCi and play around, you can use the following command:

stack --resolver lts-12.2 \
      ghci --package prettyprinter \
           --package text \
      2019-03-03-the-core-language.lhs

{-# LANGUAGE OverloadedStrings #-}
module Core.Language where
import Data.Text (Text)

The Core Language

Let’s look at the language we’re going to compile. It should look familiar if you’ve used Haskell or another ML-like language before.

This is a simple Core program that evaluates to 6:

main = addTwo 4 ;
addTwo n = n + 2

This is a more complex program showcasing all of Core’s features:

main = let f = g Pack{1,0}
        in f 4 ;
-- Lines starting like this are comments
g a b = case a of
          <1> -> b + 1 ;
          <2> -> b + 2

We’ll briefly describe each in turn.

Expressions and application

All valid constructs are expressions - they evaluate to a result. Functions are applied to arguments via juxtaposition: f 1 2. A Core program consists of a series of definitions. Definitions are delineated by a semicolon ;. Semicolons are also used where necessary in other constructs. You’ll see examples of this as we go on.

Global definitions

Functions can be defined at the global level, like g in the example above. The syntax is the same as any ML-like language.

functionName arg1 ... argN = functionBody

where arg1 ... argN are bound in functionBody.

Values can be defined similarly, resembling nullary functions.

five = 5 ;
seven = five + 2

A valid Core program must have a nullary top level definition called main. This is the entrypoint of the program.

Local definitions

let statements allow defining values in a local scope.

f = let a = 4 ;
        double x = x + x
     in double a

letrec statements are similar, but allow recursive definitions. We make the distinction between the two because non-recursive lets are often easier to implement and we can take an approach that leads to more performant code than the recursive case¹.

Data Types

In Haskell, a data type might be defined as follows:

data Colour = Red | Green | Blue
data Coord = MkCoord Int Int
data Tree a
  = Leaf a
  | Branch (Tree a)
           (Tree a)

The first case defines a type with three distinct nullary constructors. The second case defines a type with a single constructor taking two arguments. The third case combines these two, and is also parameterised over an abstract type a.

In Core we assume that typechecking has already happened, and so we don’t need to distinguish between constructors of different types. The constructor names are also not important². Core doesn’t support polymorphism, so type parameters don’t need to be considered. As such, we can simplify data type declaration. In Core, all type constructors take the form Pack{t,n} where t is the tag of the constructor and n is the arity of the constructor. The tag is an integer starting at 1 and distinguishes between constructors of the same type. The arity of a constructor is the number of arguments it takes. For example,

data Colour = Red | Green | Blue

can be represented as

Red = Pack{1,0} ;
Green = Pack{2,0} ;
Blue = Pack{3,0}

Similarly,

data Tree
  = Leaf Int
  | Branch Tree Tree

translates to

Leaf = Pack{1,1} ;
Branch = Pack{2,2}

Case expressions

We construct data with Pack, and deconstruct it with case. This is the only way to inspect data in Core, and subsumes all forms of pattern matching in the higher level language (e.g. multiple function definitions)³. Case expressions look like this:

isLeaf tree = case tree of
  <1> v -> True ;
  <2> t1 t2 -> False

A case expression consists of a scrutinee (tree in the example) followed by one or more alternatives separated by a semicolon. Alternatives take the form

<t> x1 ... xn -> e

where t is the tag of the constructor and x1 ... xn are the arguments to the constructor. The number of arguments must match the arity of the constructor. If the tag of the constructor on the scrutinee matches the tag on the alternative, the whole case expression evaluates to e. Every constructor must have a matching alternative - the behaviour when a scrutinee’s tag has no corresponding alternative is not defined (our program will probably just crash).

Built-in primitives

Core has built in support for integers, and a small set of arithmetic and boolean operations defined as binary functions. The representation of Booleans themselves will vary between implementations - often we will map them to the integers 0 and 1.

Operation	Symbol
Addition	`+`
Subtraction	`-`
Multiplication	`*`
Division	`/`
Equality	`==`
Greater than	`>`
Greater than or equal to	`>=`
Less than	`<`
Less than or equal to	`<=`
And	`&`
Or	`\|`

Integer negation is performed via the primitive function negate.

Lambda abstractions

Notably absent from Core is lambda abstraction, or anonymous functions. Functions are always supercombinators. This is an explicit choice to simplify the implementation, and we do not lose any expressive power as a result. Lambda abstractions can be mechanically removed from a program through a process known as lambda lifting. We will cover this in more detail in later section.

Modelling the language

Each implementation of the compiler will take as input a representation of a Core program, so we need to define what that is. Everything else will be built around this type.

data Expr a
  = EVar Name            -- variables
  | ENum Int             -- numbers
  | EConstr              -- constructor
            Int          --   tag
            Int          --   arity
  | EAp (Expr a)         -- applications
        (Expr a)
  | ELet                 -- let(rec) expressions
         Recursive       --   recursive (letrec) or nonrecursive (let)
         [(a, Expr a)]   --   definitions
         (Expr a)        --   body
  | ECase                -- case expressions
          (Expr a)       --   expression to scrutinise
          [Alter a]      --   alternatives
  deriving (Show)

data Recursive = Recursive | NonRecursive
  deriving (Show)

type Name = Text

Expr describes an expression in the Core language. It is parameterised over the type of its binders - a binder is a name given to a variable on the left hand side of a let(rec) expression or function definition. This will allow us to model variable binding more sophisticatedly in later parts without changing the definition of Expr. For now we can make do with simple binders, so we define a type synonym for convenience.

type CoreExpr = Expr Name

Function application is modelled by the EAp constructor. Applications of more than one argument are transformed into nested EAp nodes.

f 1 2
-- becomes
EAp (EAp (EVar "f") (ENum 1)) (ENum 2)

Case expressions consist of a scrutinee and a list of alternatives, modelled by Alter, which is a tuple of the constructor tag, the constructor arguments and the result expression.

type Alter a = (Int, [a], Expr a)
--               0    1    2

--            case s of
--              [0] [1]   [2]
--              <1> x y -> e

A supercombinator is a function with no free variables - in Core all global definitions are supercombinators.

type ScDefn a = (Name, [a], Expr a)
type CoreScDefn = ScDefn Name

A Core program is a collection of supercombinator definitions, one of which is called “main”.

type Program a = [ScDefn a]
type CoreProgram = Program Name

We can now define a complete Core program, as an example.

-- main = double 21 ;
-- double x = x + x
exampleProgram :: CoreProgram
exampleProgram
  = [
      ("main", [], EAp (EVar "double") (ENum 21))
    , ("double", ["x"], EAp (EAp (EVar "+") (EVar "x")) (EVar "x"))
    ]

Parsing and Pretty Printing

We want to be able to parse and pretty print Core, but this is not a parsing or pretty printing tutorial. Instead of going through this in detail, we will rely on the Parse and Print modules. Each of these are defined in literate Haskell files and you can read them if you want to. However I’d recommend seeking out a proper introduction to these topics if you’re not familiar with them. If I come across or recall any good ones I may link to them here for reference. The book also has an extensive section on each of these topics, though the style is somewhat outdated.

The higher-level source language might have a single unified let construct, with an analysis phase to determine which cases must be converted to letrec in Core.↩︎
Except for providing useful error messages. Since you’d have to build this support into the compiler for your higher level language, we don’t consider it here.↩︎
Compiling more expressive pattern matching into case expressions is a well studied problem and there are established algorithms for doing so. For more information on this, see Chapter 5 of The Implementation of Functional Programming Languages.↩︎