In general, parser is a program that determines whether its input is valid, referring to the given grammar. So, if we would like to parse math expression, we have to set a formal grammar first. The most convenient way to do this is to write the context-free grammar’s production rules using EBNF (Extended Backus-Naur Form) notation. Take a look at this simple example:
digit = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
The line we wrote is called a production rule and it reads: a digit can be equal to 0 or 1 or 2… or to 9. Symbol on the left side is purely abstract, and we can change it to something else by using production rule. As you might have guessed, the "|"
symbol denotes an alternative (“one of”). Values enclosed in double quotation marks are constant: they don’t appear on the left side of any production. They are often called “terminal symbols”.
We are interested in context-free grammars only, so only one symbol can appear on the left side of the production rule. "digit = ..."
or "name = ..."
are acceptable forms, while " "-" number = ..."
and "firstname lastname = ..."
are not.
It’s time for a little bit more complex example:
digit = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
number = {digit}
The "{}"
symbol means: one or more. In our case, a number consists of at least one digit; there may be more digits as well. Let’s go one step further:
digit = "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
sign = "+" | "-"
number = [sign] {digit}
We have now extended our grammar by adding the ability to describe numbers with optional sign preceding them. That’s it, "[]"
symbol means “optional”. Now we know just enough to build a grammar describing simple math expressions:
number = {"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"}
factor = number | "(" expression ")"
component = factor [{("*" | "/") factor}]
expression = component [{("+" | "-") component}]
Let’s lean over this grammar. It covers numbers, but also adding, subtracting, multiplying and dividing them. By placing part of expression inside parentheses, we can group this part together to change default computing precedence. A little remark: parentheses in component
and expression
symbols have been used just to group expressions, they don’t have any special meaning.
In the previous part, we were considering the simple expression: "2 + 3 * 7"
. We’ll check now how this expression can be derived from the grammar:
expression =
= component + component =
= factor + component =
= number + component =
= 2 + component =
= 2 + factor * factor =
= 2 + number * number =
= 2 + 3 * 7
We’ve shown how to use the grammar to achieve desired result. Of course, also much more complex expressions could be derived from our grammar.
The next step would be to write a grammar-driven parser. “Grammar-driven” means that parser is equipped with a set of functions and each function is equivalent to one production rule. Parser should read the input and at least check whether it is correct. It would be nice if it also could compute value of the given expression. Right now we possess knowledge to implement such a parser, but there will be one problem. Every single character would have to stand next to each other, without any whitespaces at all. So, instead of writing "2 + 3 * 7"
, spreadsheet user would have to write "2+3*7"
. It appears to be a problem, because whitespaces are widely used to provide legibility.
Fortunately, we can quickly come up with a fix. We will add a new production rule to our grammar:
space = " "
That’s just the beginning. We should add "space"
symbol to right side of every production rule wherever a whitespace can appear. But grammar would grow excessively fast and that’s too bad. Not only would it make grammar harder to understand; it would turn implementing the parser into coding horror.
That’s the last thing we should do. However, we can split our program into two subprograms: a lexer and a parser. Lexer reads the input and transforms it into a series of elements called tokens. Token is an abstract concept, e.g. "637"
is not a string consisting of three digits, it is a number. "+"
is not a string containing plus sign, it is an operator.
Lexer reads the input and splits it into tokens. All whitespaces are omitted. Parser expects to receive tokens, not just strings. NUMBER, PLUS_OPERATOR, NUMBER
is a correct sequence of tokens (like in "5 + 7"
or "1 + 20"
). NUMBER, PLUS_OPERATOR, PLUS_OPERATOR
is not correct (like in "7 + +"
).
Equipped with theoretical knowledge, we can finally start coding.
Go back to part 1: Introduction or proceed to part 3: Implementation.