The example implementation of a math parser is going to be written in Ruby because of its simplicity. Remember, however, that parsers are usually written in low-level languages, like C or C++. They are often located in the backbone of much bigger programs and are called frequently, so their performance has a great influence on overall system performance. Note that C code is fast, but its main disadvantage is the lack of portability (C code has to be compiled for every software platform separately).

Now we will go back to Ruby. As we have said in the previous part, it is a good idea to split parser into two parts: a lexer and a parser. We’ll start from creating the lexer as it is the basic part.

Lexer’s aim is to transform the input into a sequence of tokens. It will be convenient to create two classes: Lexer and Token. Lexer contains the get_next_token() method, which will be called by the parser. get_next_token() reads some characters from the input, recognizes them as a token (or it doesn’t) and returns an instance of the Token class. Token class looks like following:

class Token
  Plus     = 0
  Minus    = 1
  Multiply = 2
  Divide   = 3

  Number   = 4

  LParen   = 5
  RParen   = 6

  End      = 7

  attr_accessor :kind
  attr_accessor :value

  def initialize
    @kind = nil
    @value = nil
  end

  def unknown?
    @kind.nil?
  end
end

At the beginning of the class, there are available token types defined. They are usually put inside an enum-like variable, but unfortunately Ruby doesn’t provide any enum type. kind attribute holds the information about token type (Plus, Divide etc.). One may be wondering what is value attribute defined for. Well, if the token is a number, value attribute holds the number’s value. Otherwise, value will just hold the nil value. Perhaps creating NumberToken class extending Token would be a more elegant and flexible solution, but we won’t do it now as we want to focus on the main problem.

Token class is ready, so we can move on to the Lexer class:

class Lexer
  def initialize(input)
    @input = input
    @return_previous_token = false
  end

  def get_next_token
    if @return_previous_token
      @return_previous_token = false
      return @previous_token
    end

    token = Token.new

    @input.lstrip!

    case @input
      when /\A\+/ then
        token.kind = Token::Plus
      when /\A-/ then
        token.kind = Token::Minus
      when /\A\*/ then
        token.kind = Token::Multiply
      when /\A\// then
        token.kind = Token::Divide
      when /\A\d+(\.\d+)?/
        token.kind = Token::Number
        token.value = $&.to_f
      when /\A\(/
        token.kind = Token::LParen
      when /\A\)/
        token.kind = Token::RParen
      when ''
        token.kind = Token::End
    end

    raise 'Unknown token' if token.unknown?
    @input = $'

    @previous_token = token
    token
  end

  def revert
    @return_previous_token = true
  end
end

At creation time, the Lexer instance is being bound to the input. get_next_token() method was already mentioned before. It reads characters from the input and tries to recognize them as one of the token types. If it fails, it raises a RuntimeError exception.

Given case statement takes advantage of one of the Ruby’s nice features. In a when clause, object can be compared not only to a simple constant value; regular expressions and ranges can be used as well. Ale the object has to do is to respond to the === method.

One more thing that requires an explanation is the revert() method. It is likely to happen that when we are reading tokens by calling the get_next_token() method, we will go too far. In such a situation, we would like to have the ability to “unget” the last token. Otherwise, we would have to store the token somewhere and refer to it when the time comes. It would be inconvenient from the Lexer’s client point of view and that is why the revert() method has come to be.

The time has come to show the Parser class. It’s a big bunch of code. Don’t worry though, we are going to explain it method by method.

require 'lexer'

class Parser
  def parse(input)
    @lexer = Lexer.new(input)

    expression_value = expression

    token = @lexer.get_next_token
    if token.kind == Token::End
      expression_value
    else
      raise 'End expected'
    end
  end

  protected
  def expression
    component1 = factor

    additive_operators = [Token::Plus, Token::Minus]

    token = @lexer.get_next_token
    while additive_operators.include?(token.kind)
      component2 = factor

      if token.kind == Token::Plus
        component1 += component2
      else
        component1 -= component2
      end

      token = @lexer.get_next_token
    end
    @lexer.revert

    component1
  end

  def factor
    factor1 = number

    multiplicative_operators = [Token::Multiply, Token::Divide]

    token = @lexer.get_next_token
    while multiplicative_operators.include?(token.kind)
      factor2 = number

      if token.kind == Token::Multiply
        factor1 *= factor2
      else
        factor1 /= factor2
      end

      token = @lexer.get_next_token
    end
    @lexer.revert

    factor1
  end

  def number
    token = @lexer.get_next_token

    if token.kind == Token::LParen
      value = expression

      expected_rparen = @lexer.get_next_token
      raise 'Unbalanced parenthesis' unless expected_rparen.kind == Token::RParen
    elsif token.kind == Token::Number
      value = token.value
    else
      raise 'Not a number'
    end

    value
  end
end

To parse the expression, client should invoke the parse() method. Its only parameter is the input string. It creates a Lexer instance and tries to evaluate given math expression.

The expression(), factor() and number() methods come almost directly from the grammar we set before. Parentheses handling has been moved from the factor() to the number() method, so the former is not overloaded with code and, as a result, it’s easier to read.

The expression() method expects one or more components, separated with additive operators (+ or -). Each component can consist of factors, so every component is a result of the factor() method invocation. Parser goes inwards first, splits the expression into basic parts and then gather the partial results together. That’s it, our parser is recursive. One can also call it a “top-down” parser.

The factor() method is similar to the expression() method. The number() method is more interesting. It expects either a number or a whole expression enclosed in parentheses. If it encounters a number, the matter is simple. The show begins when there is a left parenthesis; a valid expression and a right parenthesis is then expected. Again, if Parser detects a syntax error, it raises a RuntimeError exception.

Example usage of the Parser class looks like this:

require 'parser'

parser = Parser.new

loop do
  begin
    print '>> '
    puts parser.parse(gets)
  rescue RuntimeError
    puts 'Error occurred: ' + $!
  end
end

User can interact with the parser by using command line interface. The script displays either expression value or an error info. Of course, the Parser class can be used in another environment, e.g. in a web application.

We can consider the parser finished, but it’s always a good idea to verify if the piece of software we are creating is correct. Let’s talk about testing for a while.

Go back to part 2: Grammar or proceed to part 4: Tests.

comments powered by Disqus