Building a tiny little broken calculator with parser combinators

Jul 25, 2022

Perhaps I have a string lying around somewhere...

const testString = "Abcd123";

What a lovely little string.

I will start with a function that takes a string as input and tells you if the input passes some test you've set out for it. This is not yet a parser, not really. Usually, a parser does quite a lot more than simply return a boolean, but at its most basic, it must make this binary distinction between

  1. Input that matches
  2. Input that does not match

So conceptually, this is a good place to start.

const aParser = input => input === "Abcd123"
aParser(testString); // true

const anotherParser = input => input === "something else"
anotherParser(testString); // false

That's not very interesting. Maybe a more interesting question?

const aMoreInterestingParser = input => input[0] === "A"
aMoreInterestingParser(testString); // true
aMoreInterestingParser("Aadjfiojda"); // true

But what now? What do we do with this return value? It would be more useful if the return value included something else that can be acted upon, like the remaining portion of the input...

I can do this quite simply as a little informal tuple:

const aBetterAndMoreInterestingParser = input =>
  [
    input[0] === "A",
    input.slice(1, input.length)
  ]
aBetterAndMoreInterestingParser("Aadjfiojda"); // [ true, 'adjfiojda' ]
aBetterAndMoreInterestingParser("Zadjfiojda"); // [ false, 'adjfiojda' ]

But wait, what if the parser fails? We don't want to keep parsing the input from there, we want to try the same place again, don't we? Then the "remaining input" should include the current character in this case:

const aBetterAndMoreInterestingParser = input => {
  if (input[0] === "A") {
    return [true, input.slice(1, input.length)]
  } else {
    return [false, input]
  }
}
aBetterAndMoreInterestingParser("Aadjfiojda"); // [ true, 'adjfiojda' ]
aBetterAndMoreInterestingParser("Zadjfiojda"); // [ false, 'Zadjfiojda' ]

Parser Generators

A parser generator is a function that returns a parser. Maybe you give it some input and it decides what to match based on that...

const parseChar = (char) =>
  (input) => {
    if (input[0] === char) {
      return [true, input.slice(1, input.length)];
    } else {
      return [false, input];
    }
  };

const parseA = parseChar('A');
const parseB = parseChar('B');
parseA(testString) // [ true, 'bcd123' ]
parseB(testString) // [ false, 'Abcd123' ]

Wait wait, I want to back up a bit. I'm describing all these things in plain english, but I can describe them more succinctly and completely using a type system. Perhaps I have one of those lying around somewhere...

Describing with types

So in this initial (but as of yet incomplete) formulation, a "parser" is a function that takes a string as input and tells you if the input passes some test you've set out for it. For now, let's pretend that input is always a string. It could be binary data or streaming data or some other abstraction over data, but for now, let's stick with strings.

type Parser = (s: string) => boolean;

Actually, speaking of "other abstractions over data", here's a trick... instead of passing the string around over and over again, I will pass around an object that references that string and holds and index denoting where the parsing has progressed to.

type Stream = {
  src: string;
  idx: number;
};
type Parser = (s: Stream) => boolean;

This is pretty similar to how the FILE struct in the C standard library works...

// stdio.h
typedef    struct __sFILE {
    unsigned char *_p;    /* current position in (some) buffer */
 // ...plenty more stuff
} FILE;

Again, these should be composable (much more on that later), so the parser should itself return a stream with its index advanced.

type Stream = {
  src: string;
  idx: number;
};
type Parser = (s: Stream) => Stream;

What happens if the parser does not match the input?

type Stream = {
  src: string;
  idx: number;
};
type Result = Stream | undefined;
type Parser = (s: Stream) => Result;

For the moment, I will return undefined for that case, so a Result is a union type that can either be another Stream or undefined.

Yes I know this is a poor imitation of Maybe. We'll get there, maybe.

So again, a generator is a function that takes some input and returns a Parser. For the moment, let's assume that input is also going to be a string, although for generators, it could really be anything as long as the output is a Parser.

type Generator = (s: string) => Parser;

Ok enough with the types, what do these look like in practice?

const char: Generator = (char) =>
  ({ src, idx }) =>
    char === src?.[idx]
      ? { src, idx: idx + 1 }
      : undefined;

This is a higher order function. A function that returns a function.

This generator takes a string and returns a Parser, which is a function that takes a Stream and returns a Result.

const parseA = char('A')

parseA({
  src: "Abba was a pretty good pop band",
  idx: 0
})

returns:

{
  src: "Abba was a pretty good pop band",
  idx: 1
}

notice that the idx has been advanced, indicating that the parser was successful.

and...

parseA({
  src: "But ELO is really good too",
  idx: 0
})

returns:

undefined

Parser Combinators

A function that takes some number of parsers and combines them somehow into a new parser:

type Combinator = (...parsers: Parser[]) => Parser;

This is where it gets spicy.

Here is a simple combinator: or.

const or: Combinator = function (p1, p2) {
  return function (input) {
    return p1(input) || p2(input);
  };
};

Again, a function returning a function, could be written more tersely as:

const or: Combinator = (p1, p2) => (input) => p1(input) || p2(input);

And so:

const parseA = char('A');
const parseB = char('B');
const parseAorB = or(parseA, parseB);

parseAorB({src: "Applesauce and orange juice", idx: 0}); // matches!
parseAorB({src: "Buckets of rain", idx: 0}); // matches!
parseAorB({src: "Canada oh Canada", idx: 0}); // does not match :(

It's a bit onerous to initialize these Streams everytime, a helper function for that might be:

const s = (s: string): Stream => ({ src: s, idx: 0 });

and is a tad bit trickier because we care about how far the first parser has advanced the index, so we need a reference to it available for the second:

const and: Combinator = (p1, p2) =>
  (input) => {
    const r = p1(input);
    if (r) {
      const r2 = p2(r);
      if (r2) {
        return r2,
      }
    }
  };
parseAandB = and(parseA, parseB);

parseAandB(s("Argyle sweaters")); // does not match :(
parseAandB(s("Beer battered bananas"); // does not match :(
parseAandB(s("ABBA"); // def matches

What is this good for?

What I've got here so far you could imagine has some practical application for something like input validation. Let's say you require a valid input to begin with "ABBA", for example:

parseAandB = and(parseA, parseB);
parseBandA = and(parseB, parseA);
parseABBA = and(parseAandB, parseBandA)

parseABBA(s("ABBAthis is valid input");
parseABBA(s("ABBAthis is not");

But of course, this is contrived and limited, and not at all what we mean when we talk about real parsers, real parsers do more than just validate input, they also return something useful!

Right now, a Result looks like:

type Result = Stream | undefined;

But actually, we don't want a Stream by itself, we also want what the parser matched:

type OutputValue = {
  stream: Stream;
  value: any;
};
type Result = OutputValue | undefined;

This necessitates some changes around what the parsers are returning, so for example, a character parser generator will now look like this:

export const char: Generator = (c) =>
  ({ src, idx }) =>
    c === src?.[idx]
      ? { stream: { src, idx: idx + 1 }, value: src[idx] }
      : undefined;

and the and combinator might now look like this:

export const and: Combinator = (p1, p2) =>
  (input) => {
    const r = p1(input);
    if (r) {
      const r2 = p2(r.stream);
      if (r2) {
        return {
          stream: r2.stream,
          value: [r.value, r2.value].flat(),
        };
      }
    }
  };

In both cases, you can see that I'm returning the matched values in some form.

So now...

parseABBA(s("ABBAthis is valid input");

returns:

{
  stream: { src: "ABBAthis is valid input", idx: 4 },
  value: [ "A", "B", "B", "A" ]
}

This is useful because now we can do stuff with the matched values.

A very simple calculator

I want to start by building a very simple calculator that can add two single digit numbers together. We've already got a char parser generator, so we can pretty easily create all the parsers we need for this:

const zero = char("1");
const one = char("1");
const two = char("2");
const three = char("3");
const four = char("4");
const five = char("5");
const six = char("6");
const seven = char("7");
const eight = char("8");
const nine = char("9");

const digit = or(nine, or(eight, or(seven, or(six, or(five, or(four, or(three, or(two, or(one, zero)))))))));

Terribly difficult to read that, isn't it? I hope the intent is clear, but that's not elegant at all. I need something more streamlined...

const any: Combinator = (...ps) => ps.reduce((p1, p2) => or(p1, p2));

This takes any number of parsers and chains the ors just like I've done long hand above,

const digit = any(zero, one, two, three, four, five, six, seven, eight, nine);

Much clearer, but any also facilitates an easier generator too:

const anyChar: Generator = (str): Parser =>
  (input) => any(...str.split("").map(char))(input);

This function takes a string, splits it into its constituent characters, maps char over them to get parsers, and then reduces or over them via any to produce a parser that will match any character in a string.

const digit = anyChar("0123456789");

very compact.

We'll of course need to match + as well:

const plus = char("+");

And now there is enough for a simple expression parser:

const expression = and(digit, and(plus, digit));

This pattern seems familiar, we can reduce and over arbitrary numbers of parsers as well, to clean up the syntax a bit:

export const andThen: Combinator = (...ps) =>
  ps.reduce((p1, p2) => and(p1, p2));

So becomes:

const expression = andThen(digit, plus, digit);

Does this work, then?

expression(s("1+2"));

It does!

{ stream: { src: "1+2", idx: 3 }, value: [ "1", "+", "2" ] }

This is where things start to get fun.

Each individual parser returns what it matched on, and only what it matched on. For digits, if it matches, then we know it's going to be a digit- that's the whole point of matching on it. But we don't really want a string representation of a digit, do we? We want a real digit.

To get this, we'll create a new higher order function that

  1. takes a parser
  2. takes an arbitrary function
  3. if the parser succeeds, run it's matched value through the function
const map = (parser: Parser, fn: Function) =>
  (input: Stream) => {
    const out = parser(input);

    if (out) {
      out.value = fn(out.value);
      return out;
    }
  };

This is general! A simple change to the digit definition above, then

const digit = map(anyChar("0123456789"), n => parseInt(n));

and:

expression(s("1+2"));

returns

{ stream: { src: "1+2", idx: 3 }, value: [ 1, "+", 2 ] }

Look closely, the digits in the value array are actual javascript digits. This may seem trivial, but I assure you, it's very powerful!

Consider for a moment: All of these parsers are built up from scratch, and the functions to which they are mapping their values to are similarly arbitrary. parseInt is a simple example, but you can do basically anything with the input as it's being parsed. Some more thoughts on this later, but I think that it is this power and flexibility that excites people when they first learn about combinators.

With the constituent parts of expression now being transformed into javascript numbers, one more step will do it...

const calculate = map(expression, (value: any) => {
  if (!value) {
    return;
  }
  const [x, operation, y] = value;

  switch (operation) {
    case "+": {
      return x + y;
    }
  }
});
calculate(s("1+2")) // { stream: { src: "1+2", idx: 3 }, value: 3 }
calculate(s("3+4")) // { stream: { src: "3+4", idx: 3 }, value: 7 }

I think this is really cool!

An expanded expression parser

Let's say we add a few more simple arithmetic operations:

const digit = map(anyChar("0123456789"), n => parseInt(n));
const plus = char("+");
const minus = char("-");
const times = char("*");
const divide = char("/");
const op = any(divide, times, plus, minus);
const expression = andThen(digit, op, digit)

const calculate = map(expression, (value: any) => {
  if (!value) {
    return;
  }
  const [x, operation, y] = value;

  switch (operation) {
    case "+": {
      return x + y;
    }
    case "-": {
      return x - y;
    }
    case "*": {
      return x * y;
    }
    case "/": {
      return x / y;
    }
  }
});

Very compact! And it will do what it looks like it will do:

calculate(s('1*7')); // { stream: { src: "1*7", idx: 3 }, value: 7 }
calculate(s('1/4')); // { stream: { src: "1/4", idx: 3 }, value: 0.25 }
calculate(s('5-2')); // { stream: { src: "5-2", idx: 3 }, value: 3 }

A small confession

So we're all set up for the next section. It's one of the more interesting sections, to be honest, so if you've made it this far, I hope it will be worth it.

My small confession is not the next section, it's a preamble to the next section, but the confession is that I've attempted several times over the last couple of years, when I had a little time, to grok parser combinators, and I just kept failing.

Over and over again, I'd watch the videos linked at the bottom of this post, and think "now I've got it!"

And I wouldn't have got it, and I would try to get them to work and I wouldn't be able to, and I'd give up. They're kind of tricky!

Finally, at this latest attempt, I worked out a few small details that are quite subtle, but make all the difference. I'm going to describe those problems and solutions now.

A slightly bigger calculator

We've got all four basic arithmetic operations available, what about something like this?

calculate(s('1+2+3'));

It's quite obvious that this should resolve to 6, and I have parsers for all the constituent parts, but of course as written, this doesn't work the way we would like it to:

{ stream: { src: "1+2+3", idx: 3 }, value: 3 }

It only manages to parse the first expression in the series of expressions, because that's all we informed it about. An expression is actually a bit more complicated than that.

const expression = or(andThen(expression, op, digit), digit);

Now, the above definition of an expression is correct, strictly speaking, but there are a few things wrong with it, and these things in combination kept me scratching my head for a long while. Let's examine the issues one by one.

The first is helpfully pointed out, and pretty obvious as well:

error: TS2448 [ERROR]: Block-scoped variable 'expression' used before its declaration.
const expression = or(andThen(expression, op, digit), digit);

expression can't use itself in it's own definition, which is on the one hand sensible, but on the other... well recursive grammars do exist, and in fact recursive function calls are used all the time... this is really a language constraint.

But there is an easy solution, and that solution is a thunk!

You wrap an operation in a function, the function is then passed around as a value instead of the operation. This means that we can specify parts of the recursive definition that are not evaluated when the grammar is defined, because the thunk, as this function is called, is not evaluated until it's invoked. It's an elegant way to introduce lazy evaluation to a language that doesn't have it built into the syntax.

function lazyAnd(first: () => Parser, second: () => Parser): Parser {
  return function (input) {
    return and(first(), second())(input);
  };
}

This is not really a combinator, because it doesn't accept Parsers, it accepts functions that take no arguments that return Parsers.

Applied to the definition of expression thusly:

const expression = or(lazyAnd(() => expression, () => and(op, digit)), digit);

And the recursive definition problem is resolved. As far as the runtime is concerned, by the time the parser is actually invoked, expression has been defined.

But now, another showstopper:

error: Uncaught RangeError: Maximum call stack size exceeded
const expression = or(lazyAnd(() => expression, () => and(op, digit)), digit);

Stripping away some of the fancy bits here, it does seem fairly obvious that since expression is the first parser evaluated as a sub part of expression, that it lacks a base case and will recurse forever.

But at the same time, that's certainly a valid definition of an expression, right?

expr
----------------
expr, op, digit
----
(1+2)  +  3

Honestly, this stumped me for a long while. The solution, when I found it, was so simple and obvious in retrospect...

There is another way to define the same expression!

expr
---------------
digit, op, expr
          -----
1      +  (2+3)

This is semantically correct as well, but crucially provides a point where the recursion can be short circuited.

const expression = or(lazyAnd(() => digit, () => and(op, expression)), digit);

With a small tweak to the calculate parser and plugging that into the expression definition, this works for as many terms as desired!

// Note `calculate` instead of `expression` --------------v
const expression = or(lazyAnd(() => digit, () => and(op, calculate)), digit);

const calculate = map(expression, (value: any) => {
  if (!value) {
    return;
  }

  // If an expression can potentially be a bare digit now, `calculate` needs to
  // be able to "evaluate" it by returning it directly.
  if (typeof value === 'number') {
    return value;
  }

  const [x, operation, y] = value;

  switch (operation) {
    case "+": {
      return x + y;
    }
    case "-": {
      return x - y;
    }
    case "*": {
      return x * y;
    }
    case "/": {
      return x / y;
    }
  }
});

and so:

calculate(s("1+2+3+4+5+6+7+8+9")),

returns

{ stream: { src: "1+2+3+4+5+6+7+8+9", idx: 17 }, value: 45 }

It's alive!

Some more problems

This little calculator is quite impressive for just a few lines of parsing code, and shows off how composable and powerful combinators can be, but it is certainly nothing but a little toy. For example, it has no concept of ordering the operations in any intelligible way except from rightmost to leftmost:

calculate(s("2*3+1")),

We know this should be interpreted as (2*3)+1 = 7 and do the multiplication first, but the poor simple parser dutifully executes from the bottom of the stack (the end of the input) to the top (the beginning), and we get 2*(3+1) = 8.

The solution to this issue is that, in practice, parsers don't usually return arbitrary types of values from their inputs, they return some over-arching structure like an abstract syntax tree composed of nodes that carry metadata and thus semantic meaning in context. A real calculator would parse the tree before evaluating it mathematically, and take into account considerations such as order of operations then. This parser as written doesn't keep much state internally except for where it has advances to in the input, and so decisions like that can't be made effectively.

Parsing into an AST is a outside the scope of this post, as it's already getting pretty long, though I hope to learn more about that and write about it! That does seem like the point at which parser combinators become genuinely useful, and not just fun.

I have also elided the monadic nature of the parsing, though they can potentially fit quite nicely into that pattern, although it would be necessary to structure them slightly differently than I have done in this post.

Thanks a LOT to Stefanie Schirmer for helping me work through some of my major difficulties with these concepts, And my batchmates Ornella Friggit, Isaac Wilder, and Marcin Jekot for pairing with me extensively, and Julia Evans, Manuel Odendahl, and Vaibhav Sagar for draft input.