Implementing a Functional Language: Parsing Core
This is an appendix to a series in implementing a functional language. The introduction is
here. This is a literate Haskell file - you can download the source
here. To load it into GHCi and play around, you can use the following command
(you’ll need the source for the Core.Language
module as well):
stack --resolver lts-12.2 \
ghci --package megaparsec \
--package text \
2019-03-03-parsing-core.lhs \
2019-03-03-the-core-language.lhs
This module exports the parser for our language (parser
). It can be passed to functions
like Text.Megaparsec.parse
to parse input into a Core.CoreProgram
. We also re-export
Text.Megaparsec.parse
for convenience.
{-# LANGUAGE OverloadedStrings #-}
module Core.Parse (parser, parse) where
import Core.Language
import Text.Megaparsec
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Megaparsec.Char
import Data.Text (Text)
import qualified Data.Text as T
import Data.Void
import Data.Proxy
Our parser is formed by combining together smaller parsers, in typical parser combinator
style. In the interest of brevity, we put little effort into useful error messages. Here
we define a type for all our parsers, which specifies a basic error type and an input type
of Text
. See the Megaparsec
documentation
for a description of the Parsec
and ErrorFancy
types.
type Parser = Parsec (ErrorFancy Void) Text
Lexing
Lexing is the process of converting a stream of characters to a stream of tokens. It’s
typically used to strip out whitespace, comments, and other parts of the input that have
no bearing on the parse result. To avoid having to process the input twice, we intersperse
lexing with parsing by wrapping all primitive parsers in a space consumer which is
responsible for stripping any trailing “space”. Our space consumer will strip whitespace
and any lines beginning with --
.
spaceConsumer :: Parser ()
= L.space space1 (L.skipLineComment "--") empty spaceConsumer
lexeme :: Parser a -> Parser a
= L.lexeme spaceConsumer lexeme
Parsing Primitives
symbol
and num
are our two primitive parsers - we’ll build everything else out of
them. symbol
parses a specific string, and num
parses a number.
symbol :: Text -> Parser Text
= L.symbol spaceConsumer symbol
num :: Parser Int
= lexeme L.decimal num
We define some helpers that we’ll rely on later on. between
is a parser combinator which
wraps a parser in a “left” and “right” parser which will parse either side of the input.
semicolon :: Parser Text
= symbol ";" semicolon
parens :: Parser a -> Parser a
= between (symbol "(") (symbol ")") parens
angles :: Parser a -> Parser a
= between (symbol "<") (symbol ">") angles
A Core identifier (i.e. a variable name, supercombinator name or let binding name) consists of alphanumeric characters. The first character can’t be a number.
identifier :: Parser Text
= lexeme $ do
identifier <- letterChar
first <- many alphaNumChar
rest pure $ tokensToChunk (Proxy :: Proxy Text) (first : rest)
A Core program is a collection of supercombinator definitions.
parser :: Parser CoreProgram
= supercombinator `sepBy1` semicolon parser
A supercombinator is a name, an optional list of arguments, and a expression representing the function body.
supercombinator :: Parser CoreScDefn
= do
supercombinator <- var
name <- many var
args "="
symbol <- expr
body pure (name, args, body)
A variable name is either an identifier
or one of the built-in operators. It can’t be a
reserved keyword.
var :: Parser Name
= choice $ operators ++ [try alphaVar]
var where alphaVar = do
<- identifier
v if v `elem` keywords
then
fail . T.unpack $ "cannot use keyword " <> v <> " as variable"
else
pure v
operators :: [Parser Text]
= map symbol ["==", "!=", ">=", "<=", "+", "-", "*", "/", ">", "<"] operators
keywords :: [Text]
= ["let", "letrec", "case", "in", "of", "Pack", "->"] keywords
An expression is either a let(rec), a case, an application, a constructor, a numeric
literal or a variable. To avoid the problem of left recursion we handle applications
separately from simple expressions. If we didn’t do this, parsing f x
would recurse
infinitely on f
.
expr :: Parser CoreExpr
= let_ <|> case_ <|> application <|> aexpr expr
aexpr :: Parser CoreExpr
= parens expr <|> constructor <|> fmap ENum num <|> fmap EVar var aexpr
application :: Parser CoreExpr
= foldl1 EAp <$> some aexpr application
let_ :: Parser CoreExpr
= do
let_ <- letrec
isRec <- letDefn `sepBy1` semicolon
defns "in"
symbol <- expr
e pure $ ELet isRec defns e
where letrec = (symbol "letrec" >> pure Recursive) <|> (symbol "let" >> pure NonRecursive)
= do
letDefn <- var
v "="
symbol <- expr
e pure (v, e)
case_ :: Parser CoreExpr
= do
case_ "case"
symbol <- expr
scrutinee "of"
symbol <- caseAlternatives
alts pure $ ECase scrutinee alts
caseAlternatives :: Parser [Alter Name]
= alt `sepBy1` semicolon
caseAlternatives where alt = do
<- angles num
tag <- many var
args "->"
symbol <- expr
e pure (tag, args, e)
constructor :: Parser CoreExpr
= do
constructor "Pack{"
symbol <- num
tag ","
symbol <- num
arity pure $ EConstr tag arity
This is all there is to the parser. Like the printer, it is short and sweet.