Math Parser, Part 3: Implementation

November 08, 2008 / category: Parsing / 8 comments

The example implementation of a math parser is going to be written in Ruby because of its simplicity. Remember, however, that parsers are usually written in low-level languages, like C or C++. They are often located in the backbone of much bigger programs and are called frequently, so their performance has a great influence on overall system performance. Note that C code is fast, but its main disadvantage is the lack of portability (C code has to be compiled for every software platform separately).

Now we will go back to Ruby. As we have said in the previous part, it is a good idea to split parser into two parts: a lexer and a parser. We'll start from creating the lexer as it is the basic part.

My eBook: “Memoirs of a Software Team Leader”
Read more »


Lexer's aim is to transform the input into a sequence of tokens. It will be convenient to create two classes: Lexer and Token. Lexer contains the get_next_token() method, which will be called by the parser. get_next_token() reads some characters from the input, recognizes them as a token (or it doesn't) and returns an instance of the Token class. Token class looks like following:

class Token
  Plus     = 0
  Minus    = 1
  Multiply = 2
  Divide   = 3

  Number   = 4

  LParen   = 5
  RParen   = 6

  End      = 7

  attr_accessor :kind
  attr_accessor :value

  def initialize
    @kind = nil
    @value = nil
  end

  def unknown?
    @kind.nil?
  end
end

At the beginning of the class, there are available token types defined. They are usually put inside an enum-like variable, but unfortunately Ruby doesn't provide any enum type. kind attribute holds the information about token type (Plus, Divide etc.). One may be wondering what is value attribute defined for. Well, if the token is a number, value attribute holds the number's value. Otherwise, value will just hold the nil value. Perhaps creating NumberToken class extending Token would be a more elegant and flexible solution, but we won't do it now as we want to focus on the main problem.

Token class is ready, so we can move on to the Lexer class:

class Lexer
  def initialize(input)
    @input = input
    @return_previous_token = false
  end

  def get_next_token
    if @return_previous_token
      @return_previous_token = false
      return @previous_token
    end

    token = Token.new

    @input.lstrip!

    case @input
      when /\A\+/ then
        token.kind = Token::Plus
      when /\A-/ then
        token.kind = Token::Minus
      when /\A\*/ then
        token.kind = Token::Multiply
      when /\A\// then
        token.kind = Token::Divide
      when /\A\d+(\.\d+)?/
        token.kind = Token::Number
        token.value = $&.to_f
      when /\A\(/
        token.kind = Token::LParen
      when /\A\)/
        token.kind = Token::RParen
      when ''
        token.kind = Token::End
    end

    raise 'Unknown token' if token.unknown?
    @input = $'

    @previous_token = token
    token
  end

  def revert
    @return_previous_token = true
  end
end

At creation time, the Lexer instance is being bound to the input. get_next_token() method was already mentioned before. It reads characters from the input and tries to recognize them as one of the token types. If it fails, it raises a RuntimeError exception.

Given case statement takes advantage of one of the Ruby's nice features. In a when clause, object can be compared not only to a simple constant value; regular expressions and ranges can be used as well. Ale the object has to do is to respond to the === method.

One more thing that requires an explanation is the revert() method. It is likely to happen that when we are reading tokens by calling the get_next_token() method, we will go too far. In such a situation, we would like to have the ability to "unget" the last token. Otherwise, we would have to store the token somewhere and refer to it when the time comes. It would be inconvenient from the Lexer's client point of view and that is why the revert() method has come to be.

The time has come to show the Parser class. It's a big bunch of code. Don't worry though, we are going to explain it method by method.

require 'lexer'

class Parser
  def parse(input)
    @lexer = Lexer.new(input)

    expression_value = expression

    token = @lexer.get_next_token
    if token.kind == Token::End
      expression_value
    else
      raise 'End expected'
    end
  end

  protected
  def expression
    component1 = factor

    additive_operators = [Token::Plus, Token::Minus]

    token = @lexer.get_next_token
    while additive_operators.include?(token.kind)
      component2 = factor

      if token.kind == Token::Plus
        component1 += component2
      else
        component1 -= component2
      end

      token = @lexer.get_next_token
    end
    @lexer.revert

    component1
  end

  def factor
    factor1 = number

    multiplicative_operators = [Token::Multiply, Token::Divide]

    token = @lexer.get_next_token
    while multiplicative_operators.include?(token.kind)
      factor2 = number

      if token.kind == Token::Multiply
        factor1 *= factor2
      else
        factor1 /= factor2
      end

      token = @lexer.get_next_token
    end
    @lexer.revert

    factor1
  end

  def number
    token = @lexer.get_next_token

    if token.kind == Token::LParen
      value = expression

      expected_rparen = @lexer.get_next_token
      raise 'Unbalanced parenthesis' unless expected_rparen.kind == Token::RParen
    elsif token.kind == Token::Number
      value = token.value
    else
      raise 'Not a number'
    end

    value
  end
end

To parse the expression, client should invoke the parse() method. Its only parameter is the input string. It creates a Lexer instance and tries to evaluate given math expression.

The expression(), factor() and number() methods come almost directly from the grammar we set before. Parentheses handling has been moved from the factor() to the number() method, so the former is not overloaded with code and, as a result, it's easier to read.

The expression() method expects one or more components, separated with additive operators (+ or -). Each component can consist of factors, so every component is a result of the factor() method invocation. Parser goes inwards first, splits the expression into basic parts and then gather the partial results together. That's it, our parser is recursive. One can also call it a "top-down" parser.

The factor() method is similar to the expression() method. The number() method is more interesting. It expects either a number or a whole expression enclosed in parentheses. If it encounters a number, the matter is simple. The show begins when there is a left parenthesis; a valid expression and a right parenthesis is then expected. Again, if Parser detects a syntax error, it raises a RuntimeError exception.

Example usage of the Parser class looks like this:

require 'parser'

parser = Parser.new

loop do
  begin
    print '>> '
    puts parser.parse(gets)
  rescue RuntimeError
    puts 'Error occured: ' + $!
  end
end

User can interact with the parser by using command line interface. The script displays either expression value or an error info. Of course, the Parser class can be used in another environment, e.g. in a web application.

We can consider the parser finished, but it's always a good idea to verify if the piece of software we are creating is correct. Let's talk about testing for a while.

Comments

There are 8 comments / Submit your comment

Nick
July 11, 2009 03:03 AM

Thanks! This is an awesome tutorial. I'm sure it will become well linked.

testking ccie
October 30, 2010 04:59 AM

When I first started thinking about writing code to calculate a mathematical expression, I soon realized that it wasn’t that easy [at least for me]. I had to delve into such concepts as parsing, tokens, lexical analysis, node trees and tables, etc.

After searching the literature I came across an outline of an algorithm for parsing arithmetic expressions in Principles of Systems Programming by Robert M. Graham (Wiley, 1975). It used the four binary operators of multiplication, division, addition, and subtraction, set precedence values for operator tokens, and provided for handling parentheses. Basically, the rules of the algorithm were the same ones we learned in algebra. For example: multiply (or divide) before adding or subtracting; start with inner parentheses and work outwards; if a pair of binary operators have equal precedence, perform the left operator of the pair first.

pete
April 07, 2013 10:52 PM

Thanks so much. This is really helpful!

Ankit
April 24, 2013 02:54 PM

Thanks a lot for this series of tutorials. Helped me a lot to get around lexical analysis. Right now, I am trying to implement variables with this. For example,

A = B = 4

E = (F = 2) * 2

Stuff like that. Are there any pointers you'd suggest to keep in mind whilst trying my hand at these?

Lukasz Wrobel
April 24, 2013 05:44 PM

@Ankit:

When you add variables, the parsing process doesn't get much more complicated. You'll only need to extend expression syntax and store variables in some kind of hashtable.

The only actual problem I see is about types. If there's only one type allowed (like in the article), then all correct expressions are correct in the sense of type-checking, too. But if you like to allow more than just one type, the situation will get much more complicated. Type inference system is too complex to be hidden in the parser; it may be a good idea to separate it, just like it's good to separate lexer and parser.

Ankit
April 24, 2013 06:08 PM

Hmm....I see. So, I need to make a hash table of the user inputs, and tokenize it accordingly, and then perform the operation on the variables, right?

I'm sorry. I am fairly new to this and super stubborn when I get stuck.

And your tutorial has been a huge help. I'd like to thank you again. This should be spread around the interwebz. :)

Lukasz Wrobel
April 25, 2013 05:54 PM

@Ankit:

So, I need to make a hash table of the user inputs, and tokenize it accordingly, and then perform the operation on the variables, right?

It doesn't necessarily have anything to do with user input. I would rather say that you need to know value held by a variable every time it is used. The first usage of a variable should be an assignment, then you can store the value in a hashtable. And if someone asks you for a variable which is not present in your hashtable, you should raise an exception (unless you initialize variables with zero by default, which I wouldn't recommend).

Andy
October 16, 2013 10:03 PM

Lovely article.

For your information, I have also implemented C++/Java expression parsers to convert expressions into reverse Polish notation and evaulate the result. Some free downloads can be found at this link.:

http://www.technical-recipes.com/2011/a-mathematical-expression-parser-in-java-and-cpp/

Comments and suggestions welcome/

You can use Markdown in your comments if you wish. Examples:

*emphasis*
emphasis
**strong**
strong
`inline code`
inline code
[My blog](http://lukaszwrobel.pl)
My blog
# use 4 spaces to indent
# a block of code
    def my_method(x)
      x = x + 1
    end
def my_method(x)
  x = x + 1
end

* First.
* Second.
  • First.
  • Second.

> This is a citation.
> Even more citation.

I don't agree with you.

This is a citation. Even more citation.

I don't agree with you.


Submit your comment

(required)

(optional)

(required, Markdown supported)


Preview:

My eBook: “Memoirs of a Software Team Leader”

Read more »