March 8, 2019

Implementing a Functional Language: Parsing Core

This is an appendix to a series in implementing a functional language. The introduction is here. This is a literate Haskell file - you can download the source here. To load it into GHCi and play around, you can use the following command (you’ll need the source for the Core.Language module as well):

stack --resolver lts-12.2         \
      ghci --package megaparsec   \
           --package text         \
      2019-03-03-parsing-core.lhs \
      2019-03-03-the-core-language.lhs

This module exports the parser for our language (parser). It can be passed to functions like Text.Megaparsec.parse to parse input into a Core.CoreProgram. We also re-export Text.Megaparsec.parse for convenience.

Our parser is formed by combining together smaller parsers, in typical parser combinator style. In the interest of brevity, we put little effort into useful error messages. Here we define a type for all our parsers, which specifies a basic error type and an input type of Text. See the Megaparsec documentation for a description of the Parsec and ErrorFancy types.

Lexing

Lexing is the process of converting a stream of characters to a stream of tokens. It’s typically used to strip out whitespace, comments, and other parts of the input that have no bearing on the parse result. To avoid having to process the input twice, we intersperse lexing with parsing by wrapping all primitive parsers in a space consumer which is responsible for stripping any trailing “space”. Our space consumer will strip whitespace and any lines beginning with --.

Parsing Primitives

symbol and num are our two primitive parsers - we’ll build everything else out of them. symbol parses a specific string, and num parses a number.

We define some helpers that we’ll rely on later on. between is a parser combinator which wraps a parser in a “left” and “right” parser which will parse either side of the input.

A Core identifier (i.e. a variable name, supercombinator name or let binding name) consists of alphanumeric characters. The first character can’t be a number.

A Core program is a collection of supercombinator definitions.

A supercombinator is a name, an optional list of arguments, and a expression representing the function body.

A variable name is either an identifier or one of the built-in operators. It can’t be a reserved keyword.

An expression is either a let(rec), a case, an application, a constructor, a numeric literal or a variable. To avoid the problem of left recursion we handle applications separately from simple expressions. If we didn’t do this, parsing f x would recurse infinitely on f.

This is all there is to the parser. Like the printer, it is short and sweet.