The example implementation of a math parser is going to be written in Ruby because of its simplicity. Remember, however, that parsers are usually written in low-level languages, like C or C++. They are often located in the backbone of much bigger programs and are called frequently, so their performance has a great influence on overall system performance. Note that C code is fast, but its main disadvantage is the lack of portability (C code has to be compiled for every software platform separately).
Now we will go back to Ruby. As we have said in the previous part, it is a good idea to split parser into two parts: a lexer and a parser. We’ll start from creating the lexer as it is the basic part.
Lexer’s aim is to transform the input into a sequence of tokens. It will be convenient to create two classes: Lexer
and Token
. Lexer contains the get_next_token()
method, which will be called by the parser. get_next_token()
reads some characters from the input, recognizes them as a token (or it doesn’t) and returns an instance of the Token
class. Token
class looks like following:
class Token
Plus = 0
Minus = 1
Multiply = 2
Divide = 3
Number = 4
LParen = 5
RParen = 6
End = 7
attr_accessor :kind
attr_accessor :value
def initialize
@kind = nil
@value = nil
end
def unknown?
@kind.nil?
end
end
At the beginning of the class, there are available token types defined. They are usually put inside an enum-like variable, but unfortunately Ruby doesn’t provide any enum type. kind
attribute holds the information about token type (Plus
, Divide
etc.). One may be wondering what is value
attribute defined for. Well, if the token is a number, value
attribute holds the number’s value. Otherwise, value
will just hold the nil
value. Perhaps creating NumberToken
class extending Token
would be a more elegant and flexible solution, but we won’t do it now as we want to focus on the main problem.
Token
class is ready, so we can move on to the Lexer
class:
class Lexer
def initialize(input)
@input = input
@return_previous_token = false
end
def get_next_token
if @return_previous_token
@return_previous_token = false
return @previous_token
end
token = Token.new
@input.lstrip!
case @input
when /\A\+/ then
token.kind = Token::Plus
when /\A-/ then
token.kind = Token::Minus
when /\A\*/ then
token.kind = Token::Multiply
when /\A\// then
token.kind = Token::Divide
when /\A\d+(\.\d+)?/
token.kind = Token::Number
token.value = $&.to_f
when /\A\(/
token.kind = Token::LParen
when /\A\)/
token.kind = Token::RParen
when ''
token.kind = Token::End
end
raise 'Unknown token' if token.unknown?
@input = $'
@previous_token = token
token
end
def revert
@return_previous_token = true
end
end
At creation time, the Lexer
instance is being bound to the input. get_next_token()
method was already mentioned before. It reads characters from the input and tries to recognize them as one of the token types. If it fails, it raises a RuntimeError
exception.
Given case
statement takes advantage of one of the Ruby’s nice features. In a when clause, object can be compared not only to a simple constant value; regular expressions and ranges can be used as well. Ale the object has to do is to respond to the ===
method.
One more thing that requires an explanation is the revert()
method. It is likely to happen that when we are reading tokens by calling the get_next_token()
method, we will go too far. In such a situation, we would like to have the ability to “unget” the last token. Otherwise, we would have to store the token somewhere and refer to it when the time comes. It would be inconvenient from the Lexer
’s client point of view and that is why the revert()
method has come to be.
The time has come to show the Parser
class. It’s a big bunch of code. Don’t worry though, we are going to explain it method by method.
require 'lexer'
class Parser
def parse(input)
@lexer = Lexer.new(input)
expression_value = expression
token = @lexer.get_next_token
if token.kind == Token::End
expression_value
else
raise 'End expected'
end
end
protected
def expression
component1 = factor
additive_operators = [Token::Plus, Token::Minus]
token = @lexer.get_next_token
while additive_operators.include?(token.kind)
component2 = factor
if token.kind == Token::Plus
component1 += component2
else
component1 -= component2
end
token = @lexer.get_next_token
end
@lexer.revert
component1
end
def factor
factor1 = number
multiplicative_operators = [Token::Multiply, Token::Divide]
token = @lexer.get_next_token
while multiplicative_operators.include?(token.kind)
factor2 = number
if token.kind == Token::Multiply
factor1 *= factor2
else
factor1 /= factor2
end
token = @lexer.get_next_token
end
@lexer.revert
factor1
end
def number
token = @lexer.get_next_token
if token.kind == Token::LParen
value = expression
expected_rparen = @lexer.get_next_token
raise 'Unbalanced parenthesis' unless expected_rparen.kind == Token::RParen
elsif token.kind == Token::Number
value = token.value
else
raise 'Not a number'
end
value
end
end
To parse the expression, client should invoke the parse()
method. Its only parameter is the input string. It creates a Lexer
instance and tries to evaluate given math expression.
The expression()
, factor()
and number()
methods come almost directly from the grammar we set before. Parentheses handling has been moved from the factor()
to the number()
method, so the former is not overloaded with code and, as a result, it’s easier to read.
The expression()
method expects one or more components, separated with additive operators (+
or -
). Each component
can consist of factors, so every component is a result of the factor()
method invocation. Parser goes inwards first, splits the expression into basic parts and then gather the partial results together. That’s it, our parser is recursive. One can also call it a “top-down” parser.
The factor()
method is similar to the expression()
method. The number()
method is more interesting. It expects either a number or a whole expression enclosed in parentheses. If it encounters a number, the matter is simple. The show begins when there is a left parenthesis; a valid expression and a right parenthesis is then expected. Again, if Parser
detects a syntax error, it raises a RuntimeError
exception.
Example usage of the Parser
class looks like this:
require 'parser'
parser = Parser.new
loop do
begin
print '>> '
puts parser.parse(gets)
rescue RuntimeError
puts 'Error occurred: ' + $!
end
end
User can interact with the parser by using command line interface. The script displays either expression value or an error info. Of course, the Parser
class can be used in another environment, e.g. in a web application.
We can consider the parser finished, but it’s always a good idea to verify if the piece of software we are creating is correct. Let’s talk about testing for a while.
Go back to part 2: Grammar or proceed to part 4: Tests.