Implementing a Functional Language: Parsing Core
This is an appendix to a series in implementing a functional language. The introduction is here. This is a literate Haskell file - you can download the source here. To load it into GHCi and play around, you can use the following command (you’ll need the source for the
Core.Language module as well):
stack --resolver lts-12.2 \ ghci --package megaparsec \ --package text \ 2019-03-03-parsing-core.lhs \ 2019-03-03-the-core-language.lhs
This module exports the parser for our language (
parser). It can be passed to functions like
Text.Megaparsec.parse to parse input into a
Core.CoreProgram. We also re-export
Text.Megaparsec.parse for convenience.
Our parser is formed by combining together smaller parsers, in typical parser combinator style. In the interest of brevity, we put little effort into useful error messages. Here we define a type for all our parsers, which specifies a basic error type and an input type of
Text. See the Megaparsec documentation for a description of the
Lexing is the process of converting a stream of characters to a stream of tokens. It’s typically used to strip out whitespace, comments, and other parts of the input that have no bearing on the parse result. To avoid having to process the input twice, we intersperse lexing with parsing by wrapping all primitive parsers in a space consumer which is responsible for stripping any trailing “space”. Our space consumer will strip whitespace and any lines beginning with
num are our two primitive parsers - we’ll build everything else out of them.
symbol parses a specific string, and
num parses a number.
We define some helpers that we’ll rely on later on.
between is a parser combinator which wraps a parser in a “left” and “right” parser which will parse either side of the input.
A Core identifier (i.e. a variable name, supercombinator name or let binding name) consists of alphanumeric characters. The first character can’t be a number.
A Core program is a collection of supercombinator definitions.
A supercombinator is a name, an optional list of arguments, and a expression representing the function body.
A variable name is either an
identifier or one of the built-in operators. It can’t be a reserved keyword.
An expression is either a let(rec), a case, an application, a constructor, a numeric literal or a variable. To avoid the problem of left recursion we handle applications separately from simple expressions. If we didn’t do this, parsing
f x would recurse infinitely on
let_ :: Parser CoreExpr let_ = do isRec <- letrec defns <- letDefn `sepBy1` semicolon symbol "in" e <- expr pure $ ELet isRec defns e where letrec = (symbol "letrec" >> pure Recursive) <|> (symbol "let" >> pure NonRecursive) letDefn = do v <- var symbol "=" e <- expr pure (v, e) case_ :: Parser CoreExpr case_ = do symbol "case" scrutinee <- expr symbol "of" alts <- caseAlternatives pure $ ECase scrutinee alts
This is all there is to the parser. Like the printer, it is short and sweet.