Parser Combinators in Ruby
Parser combinators are a technique common in FP languages for writing parsers - programs that convert textual input into a data structure native to the language. An ubiquitous example is a JSON parser, which almost every modern language has. Parser combinators are small building blocks that you compose together to form larger parsers. They’re nice for a number of reasons, but one is that the parser you end up with is extremely readable - often it looks very close to the formal grammar of the language you’re parsing. This makes it easy to spot bugs and extend in the future.
I won’t give a full introduction to parser combinators here. If you’re not familiar with them, there are a lot of tutorials floating about the internet. I particularly recommend Monadic Parser Combinators by Graham Hutton and Erik Meijer. If you’ve written parsers in the past and found yourself in a bit of a mess of regular expressions and string munging, then you might consider parser combinators as an alternative. In this post I’m going to walk through the creation of a simple parser combinator library in Ruby and use it to build a parser for JSON. The library and JSON parser will each be about 100 LOC.
As an example of what we’re aiming for, here’s a parser written in the combinator style in Haskell:
= brackets (sepBy comma number)
parseNumberList
= between (string "[") (string "]") p
brackets p = do
sepBy sep p <- p
first <- many (sep *> p)
rest pure (first : rest)
= string ","
comma = do
between open close inner
open<- inner
val
closepure val
= ... number
The high level parser parseNumberList
is constructed from small building blocks,
including several parsers which modify other parsers: these are called combinators.
brackets
parses an open bracket, the given inner parser, and a closing bracket.
between
is a generic version of brackets
which runs the three parsers provided in the
same order. Because these parsers are generic we can reuse them in many places. For
example, if we wanted a parser for a two-element tuple like (1,2)
we might write:
= parens do
tuple fst <- number
commasnd <- number
pure (fst, snd)
= between (string "(") (string ")") p parens p
The end result tends to be a very small amount of code and a very clear description of the language.
To parse in the combinator style you need two things: first class functions and custom flow control. In Ruby we’ll model these respectively using Procs and exceptions.
I’ll briefly describe Procs and exceptions - feel free to skip this part if you’re comfortable with them.
Procs
Ruby’s equivalent of a first class function is the Proc (or lambda):
= proc { |x| x + 1 }
add_one
> add_one.call(1)
=> 2
= proc { |x| x }
id
> id.call("ruby")
=> "ruby"
proc
takes as argument a block, which is the body of the function. Any parameters in the
block become formal parameters to the constructed Proc. Procs have access to any variables
in scope when they are constructed, meaning that they form closures.
= 1
a = proc { a += 1 }
inc_a .call
inc_a.call
inc_a
> a
=> 3
Exceptions
When we write two sequential statements in a Ruby method, they will be executed in the usual imperative order.
def a
puts "a"
end
def b
puts "b"
end
def foo
a
bend
> foo
a
b=> nil
a
has no way to control whether b
is executed after it. However if a
raises
an exception then this changes: control will jump out of the method and any enclosing
methods until we reach a matching rescue
statement. We can abuse this feature to control
how our parsers behave. Specifically, if a parser is made of several smaller parsers
combined together, and one of the smaller parsers fail, we want the entire parser to fail.
We want to be able to override this when necessary, but this should be the default.
The Core
Armed with these language features we can start to construct our parser. We’ll assume that
the input is always a string, and we’ll track our progress through it with an index loc
(lazy shorthand for “location”).
class Parser
def initialize(input)
@input = input
@loc = 0
end
# ...
end
The intuition is that our parsing methods will “consume” portions of the input, advancing the index as they do. Subsequent parses will only see the remaining input.
def input
@input.byteslice(@loc..-1)
end
def consume(num)
@loc += num
end
If a parse fails, we want to reset the index back to where it was before we started that parse. This behaviour is called backtracking, and it isn’t the only option in parser design, but we use it here because I think this behaviour makes it easier to write clear parsers.
To provide this behaviour we define a method backtrack
, which takes a block. If that
block fails to parse, backtrack
will reset the index after it. How do we know if a parse
has failed? We’ll use an exception. This has the nice property that if the parser consists
of multiple parts and the first part fails, the subsequent parts will be skipped. Each
parser doesn’t need to check if the previous parser succeeded, because if it had failed it
would have raised an exception and we would never reach the second parser. This is what
allows us to write parsers cleanly (as you’ll see in a minute) without any plumbing of
checking return values between one and the next.
def backtrack
= @loc
loc_before yield
rescue ParseFailure
@loc = loc_before
# re-raise the exception
# to propagate the failure
raise
end
These three methods form the core of our parser: everything else can be constructed using
them. First up, let’s define the most permissive parser: take
. take(n)
will parse n
characters from the input. It will only fail if there are fewer than n
characters left.
def take(len)
= input.byteslice(0, len)
match if match.length < len
raise ParseFailure, "expected #{len} characters, but saw #{match.length}"
end
consume(len)
matchend
We first fetch len
characters, raising ParseFailure
if we don’t get enough characters.
We then consume
the characters we’ve just fetched, and return them. This is how it
behaves:
p = Parser.new("what a great example")
p.take(100)
=> Parser::ParseFailure:
100 characters, but saw 20
expected p.take(4)
=> "what"
p.take(2)
=> " a"
Next up: string literals.
def string(pat)
do
backtrack = take(pat.length)
s fail(pat, s) unless s == pat
send
end
def fail(expected, actual)
raise ParseFailure, "expected #{expected.inspect} but saw #{actual.inspect} (#{@loc})"
end
string("abc")
will succeed if the input starts with the string abc
. To implement it,
we take
enough characters to match the length of the given string, and then check if
they match. If they don’t, we raise ParseFailure
(via a convenience method fail
). We
wrap all of this in backtrack
to ensure that when we do fail, we roll back the index.
Finally, we’ll want a way to repeatedly parse characters matching some predicate. This is
take_while
1:
def take_while(pred)
.chars.take_while(&pred).join.tap { |match| consume(match.length) }
inputend
We can now write simple parsers, such as this:
class PurchaseOrder < Parser
def run
"BUY: "
string proc { |c| c != "\n" }
take_while end
end
> Order.new("BUY: eggs").run
=> "eggs"
Combinators
To complete our parser, we need ways to compose these primitives together. This is where
we’ll meet our combinators. The first is either
:
def either(parser1, parser2)
{ parser1.call }
backtrack rescue ParseFailure
.call
parser2end
either
takes two parsers as arguments. It tries the first, and if that fails it
backtracks and tries the second.
= proc do |input|
run p = Parser.new(input)
p.either(
proc { p.string("foo") },
proc { p.string("bar") },
)end
.("foo")
run=> "foo"
.("bar")
run=> "bar"
.("cat")
runParser::ParseFailure:
"bar" but saw "cat" (3) expected
Next we have zero_or_more
:
def zero_or_more(parser)
= []
matches loop { matches << backtrack { parser.call } }
rescue ParseFailure
matchesend
zero_or_more
takes a parser and tries to apply it as many times as possible, returning
an array of results. It always succeeds, as the parser can match zero times. This is
analogous to the *
regex operator.
= proc do |input|
run p = Parser.new(input)
p.zero_or_more(proc { p.string("a") })
end
.("a")
run=> ["a"]
.("aaaabb")
run=> ["a", "a", "a", "a"]
.("baaa")
run=> []
If we want the behaviour of regex +
, we can use at_least_one
:
def at_least_one(parser)
= parser.call
first = zero_or_more(parser)
rest [first, *rest]
end
at_least_one
is identical to zero_or_more
except that the parser must succeed at least
once.
The last combinator we’ll look at is sep_by
, which is the same as sepBy
in the Haskell
example from the start. Given a separator and a parser, it will try to parse repeated
instances of the parser interspersed with the separator (which is itself a parser).
def sep_by(separator, parser)
= begin
first { parser.call }
backtrack rescue ParseFailure
return []
end
= proc do
combined .call
separator.call
parserend
= zero_or_more combined
rest [first, *rest]
end
= proc do |input|
run p = Parser.new(input)
p.sep_by(
proc { p.string(",") },
proc { p.take(1) },
)end
.("")
run=> []
.("1,2,3")
run=> ["1", "2", "3"]
.("1,2,3,45")
run=> ["1", "2", "3", "4"]
That’s basically it! There are a couple more combinators we could define, like optional
,
one_of
and between
, but they are straightforward. Let’s look instead at writing a real
world parser with this tooling.
Parsing JSON
We’re going to write a mostly-compliant JSON parser. It won’t handle unicode and its
floating point behaviour will be a bit broken, but it will otherwise work correctly. I’ve
tested it against the fixtures in the JSON Parsing Test Suite
and it passes 109 of the 141 y_
tests (samples that must be accepted by the parser). In
total, the whole parser is 119 lines long.
To start, we define our class and entrypoint
class JsonParser < Parser
def run
json_valueend
json_value
parses, well, any JSON value.
def json_value
[
one_of :object),
method(:array),
method(:quoted_string),
method(:boolean),
method(:null),
method(:number),
method(]
end
one_of
is a combinator that acts like a variadic either
: it will try each parser in
turn until one succeeds. Each of the parsers given to it is a method on our class, so we
convert them to first class objects using
Object#method
. This will
allow us to #call
them just like Procs. We’ll walk through each of these methods in
turn.
def object
= proc do
inner = sep_by method(:comma), method(:key_value_pair)
kvs Hash[kvs]
end
proc { token "{" },
between proc { token "}" },
innerend
def token(str)
.tap { skip_spaces }
string(str)end
def skip_spaces
proc { |c| [" ", "\n"].include?(c) })
take_while(end
object
uses between
to parse the opening and closing brackets. Inside the brackets we
parse a series of key-value pairs. We use sep_by
with a separator of comma
(which does
what you’d expect), and an inner parser of key_value_pair
. This gives us a nested array
which we convert to a Hash before returning. To parse the brackets we use token
, which
is a wrapper around string
which consumes any trailing whitespace. We’ll use
it throughout the parser - it allows us to largely ignore whitespace and keep
things concise.
def key_value_pair
= quoted_string
key ":"
token = json_value
value [key, value]
end
def quoted_string
= between proc { string "\"" },
str proc { token "\"" },
proc { take_while(proc { |c| c != "\"" }) }
strend
key_value_pair
is a string enclosed in quotes followed by a semicolon, followed by any
JSON value. We naïvely assume that a string is any sequence of characters excluding the
double quote character. This obviously doesn’t cater for escape characters, but we’ll
conveniently ignore that. Note that we can’t use token
for the opening quote in
quoted_string
because any trailing whitespace after that character forms part of the
string we’re trying to parse.
That’s JSON objects, then. Next up, arrays:
def array
= between proc { token "[" },
res proc { token "]" },
proc { sep_by method(:comma), method(:json_value) }
resend
We again use between
to handle the enclosing brackets, and inside them we just parse a
series of JSON values separated by commas.
Booleans and null
s are very simple:
def boolean
= either proc { token "true" }, proc { token "false" }
bool == "true"
bool end
def null
"null"
token nil
end
And all that’s left is numbers. JSON has one number type, which is the float. Parsing floats is a bit complex as we have to deal with positive/negative signs, decimals and exponents, but we can cater for all of that without a huge amount of code. It is, however, a little less clear than what we’ve covered so far. I’ll leave it as an exercise to work out what we’re doing here - hopefully it’s not too hard!
def number
= optional method(:sign)
sign = integer
result = optional(proc { string "."; integer })
decimals = optional(proc do
exponent proc { string "e" }, proc { string "E" })
either(
signed_integerend)
= result.to_f if decimals || exponent
result += (decimals.to_f / (10**decimals.to_s.length)) if decimals
result *= 10**exponent.to_f if exponent
result == "-" ? 0 - result : result
sign end
def signed_integer
= optional method(:sign)
sign = integer
n == "-" ? 0 - n : n
sign end
INTEGERS = (0..9).map(&:to_s)
def integer
= proc do
int = take 1
char fail("one of #{INTEGERS}", char) unless INTEGERS.include?(char)
charend
= at_least_one int
numstr
skip_spaces.join.to_i
numstrend
def sign
proc { token "-" },
either proc { token "+" }
end
That’s the whole thing. Let’s try it out on some JSON.
JsonParser.new("{}").run
=> {}
JsonParser.new("[]").run
=> []
JsonParser.new("4").run
=> 4
JsonParser.new(<<EOF).run
{
"a": "sample",
"json" : "object"
}
EOF
=> {"a"=>"sample", "json"=>"object"}
JsonParser.new(<<EOF).run
{
"a": "sample",
"json": "object",
"with": [
"an",
"array",
-1.23e3,
{ "two": "three" }
]
}
EOF
=> {
"a" => "sample",
"json" => "object",
"with" => [
"an",
"array",
-1230.0,
{ "two" => "three" }
]
}
JsonParser.new("bad input").run
=> Parser::ParseFailure: expected one of: [
#<Method: JsonParser#object>,
#<Method: JsonParser#array>,
#<Method: JsonParser#quoted_string>,
#<Method: JsonParser#boolean>,
#<Method: JsonParser#null>,
#<Method: JsonParser#number>
]
/parser.rb:60:in `one_of' from lib
And that’s it! A (mostly) complete JSON parser in around 100 LOC. Hopefully that gives you an idea of how you can take these parsing techniques and apply them outside of functional programming languages.
Performance
There’s one final thing to address, and that is performance. Ruby isn’t known for its speed, and for tasks like this it often delegates to an underlying C library to do the heavy lifting - this is the case for most JSON, YAML and XML parsers. The parser we’ve written here is, comparatively, extremely slow: parsing a 100KB JSON document on my laptop takes 3.5 seconds compared to ~0.15 seconds using the Ruby standard library JSON parser2. I’m not yet sure whether this is due to slow string indexing, repeated backtracking, overuse of exceptions, or something else - it would be quite interesting to find out. Until then, just keep in mind that whilst these techniques do translate, the tradeoffs might be very different between languages!
You can find all the code for this post here.
A more idiomatic definition of
take_while
would take a block argument rather than a proc, so that you can writetake_while { |c| ... }
. I’ve not done this here to keep the usage of Procs consistent, but you could certainly do so in the real world.↩︎To be specific, the
ext
implementation, which is 2000 lines of C, rather than the pure Ruby implementation.↩︎