Math Parser, Part 1: Introduction

Everyone knows how to calculate value of a simple math expression, like "2 + 3 * 7". But not every programmer knows how to write a program that would accomplish the same task.

The first idea usually looks like this:

result = get_number()
while operator = get_operator()
  operand = get_number()
  result = result operator operand
end

return result

It means: “read the expression from left to right and perform appropriate calculations when operator occurred”. Let’s try this approach out on our previous example, "2 + 3 * 7":

get_number: 2

get_operator: +
get_number: 3
result: 5

get_operator: *
get_number: 7
result: 35

So we end up with 35 as the expected expression’s value. Obviously, we are mistaken. The reason is simple: our algorithm does not preserve operator precedence. It would be acceptable for long additive operations chains like "5 + 6 - 7 + 2 - 1", but algorithm will fail on even slightly more complex expressions. It seems like we need a completely different approach.

Reverse Polish Notation

The notation that we used to know from school is called an “infix notation”, since the operator is written between its arguments ("5 + 2"). But there are also other notations, like prefix notation and postfix notation. They are called “Polish notation” and Reverse Polish notation, respectively. The first one was invented by Polish logician Jan Łukasiewicz, while the latter is complementary to the first one. The Reverse Polish notation (postfix notation) makes calculating expression value a whole lot easier. Let’s look at an example of expression written using Reverse Polish notation:

5 2 +

As you can see, the operator is written after its arguments. Reverse Polish notation has two big advantages: it doesn’t require to set operators’ priority and it allows to compute the value of the expression using a simple algorithm. All we have to know is how many arguments does the specific operator require. Moreover, any expression written in Reverse Polish notation is unambiguous, even without using parenthesis.

So, our first example, "2 + 3 * 7", looks like this when written in Reverse Polish notation:

2 3 7 * +

The algorithm that calculates value of the expression written in Reverse Polish notation is usually implemented using a stack. It consists of few simple steps:

while input_left
end

while symbol = get_symbol()
  if symbol.is_number?
    stack.push(symbol)
  elsif symbol.is_operator?
    arguments = pop_as_many_numbers_from_stack_as_the_operator_requires
    result = argument1 symbol argument2...
    stack.push(result)
  end
end

return stack.pop()

Let’s follow these steps using our example, "2 3 7 * +":

read 2:
  push 2 on stack
read 3:
  push 3 on stack
read 7:
  push 7 on stack
read *:
  pop two numbers from stack
  got 3 and 7, calculate 3 * 7 = 21
  push 21 on stack
read +:
  pop two numbers from stack
  got 2 and 21, calculate 2 + 21 = 23
  push 23 on stack

return 23

Algorithm is quite simple, and it is also quite easy to implement. By the way, Reverse Polish notation calculator is a common task on a programming course at the beginning of computer science study (even if it is called a “Postfix calculator”, it’s still the same task). If you are looking for a ready-to-use implementation, read the Reverse Polish notation parser article.

Everything seems to be fine, but could you imagine yourself telling a spreadsheet’s user: “Use postfix notation instead. It saves me a lot of work.”?. Our deliberations have been interesting from theoretical point of view, but we need some practical solution for computing infix expressions’ values. Of course, we could use an algorithm to convert expression from infix to postfix notation, but that’s not what we were looking for. Why? Firstly, conversion itself is complicated. Secondly, what if we would like to evaluate something different than math expressions, e.g. procedure written in the programming language that we just invented?

We need a more general solution: the recursive parser. Stay tuned!

Proceed to part 2: Grammar.