How evaluate.c works

evaluate() is comprised of two main parts - a scanner (also known as a tokenizer) and a token evaluator. There is also the variable table, but that simply a data structure that maps names to values.

The scanner turns a stream of characters (the expression string) into a stream of tokens.

  1. The first thing it does is allocate enough memory, where the worst case is that every character is an individual token. It now considers this as a 'list' of tokens, with a 'list pointer' pointing at the first unused token.
  2. We then perform a loop that traverses every character individually. We can make the loop pointer 'skip forward' for tokens that have more than one character.
  3. To translate characters into tokens very quickly, we have a 'scantable', where every possible character is given a token value. This token value is written into the 'current token'. But that's not enough to tokenize. So, we then make a few decisions based on what the token was.
    • For most of the operators, all necessary scanning was done with just that one non-alphanumeric character, so all they have to do is advance the list pointer to the next token. A few other operators might have more characters, so they check the next one and advance the stream if neccessary.
    • Assignment is special. As we always have to put a variable name before it, all we do is look back one token and if it is a variable, change its token from TK_VAR to TK_ASSN. Of course, if there isn't a variable before it, then there's definately a syntax error in the string.
    • For numbers, we have to do a quite a bit of work. The number scanner is in another function so we can re-use it later, outside the scanner. The basic premise is that while we have another digit that's OK, we can multiply by 10 (or 16 for hex numbers) and add the value of the character we just read. For decimal numbers, if we finish over a decimal point, we can turn the number into a real value and start reading numbers after the decimal point.
    • For strings, written in the scantable as TK_VAR, first we read on until we know what the whole string is. If we had operators that were actually strings instead of symbols, like perl has "and", "or" and "not", we would make comparisons here. We try to see if its a function be comparing it to all the names in the function table. If so, we turn it into a TK_FUNC token. Otherwise, we can assume it's a variable, and thus mark the start and end of the name for use later on.
    • We also have a few special tokens that aren't 'real', that is they're part of the scanning process and not part of the evaluation.
    • TK_SKIP says that this character isn't part of any token, but isn't wrong to see it in the expression. It allows us to have spaces into our expressions and not get into trouble for them.
    • TK_ERROR says that this character isn't part of any token, and it is wrong to see it in the expression.
    • TK_BREAK says that the expression scanning is finished and OK, even though we haven't reached the end of the character stream yet. The main loop that does tokenization and evaluation knows about this and will call us again and again until we eventually do reach the end.
  4. After a successful scan, we 'lace up' the tokens with pointers, so we can easily insert other tokens without copying them about. We also null-terminate any variable names - we couldn't do this earlier as it would destroy the character that defined the next token.

Now we've done the token scan, we must perform work on the tokens until they are fit for evaluation.

  • The first thing we do is wrap the expression in a set of parentheses, because this avoids the use of special cases in our algorithims for the start and end. In particular, a closing set of parentheses forces a 'flush' of pending operations, so our evaluation is guaranteed to have have no remaining operations stacked at the end of evaluation.
  • Firstly, all the minus signs in the string were converted to TK_SUB, which is a binary operator, so we have to convert the ones we think are unary to TK_NEG. In my opinion, they are are the ones that follow an open parenthesis or an operator. The opposite is also true - the TK_SUBs become TK_NEGs provided they're not preceded by a close parenthesis, a value, or variable.
  • Implicit multiplication is where you can write "5a" and mean "5*a", however it's my opinion that you really want to mean "(5*a)", so I've given implicit multiplication a higher precedence than everything else. My opinion for implicit multiplication is that it goes between any two tokens where the left one is a variable, value or close bracket, and the right one is a variable, value, open bracket or function. The only exception is where the left and right tokens are the same - it doesn't follow that "1 2 3" should evaluate to 6, it should be a syntax error.
  • Variables have to exist and be pointed at by tokens, so we look up any assignments in the token stream to ensure those tokens are created if necessary, and then look for any variable references to ensure that all variable already exist or are pulled from the environment. Admittedly, this doesn't catch errors like "x=x+1" where x didn't exist before this evaluation, but this is an evaluator, not a programming language. In such cases, x has a value of 0.

Finally, we're ready for evaluation. We make two stacks - one for holding values and one for holding operators. Why? Well, we'll be using what's known as postfix evaluation, which is unambiguous. This requires that we stack all values as we see them then apply the operators in the correct order. The operators pull as many values as neccesary from the stack, then push the result back. Eventually, there's only one value on the stack - the result. However the token stream is in infix notation - what can we do? Convert one to the other. The idea of an infix to postfix converter is that the innermost operations get done first, then the outermost ones. This is done by running from left to right through the token stream and pushing operators to a stack. However, for binary operators, all higher-precedence operators already on the stack must be output to some destination stream before a new operator can go on. Similarly, a close bracket forces a 'flush' of the stack back to the nearest open bracket on it. The algorithim goes like so:

empty postfix list
FOR all tokens in infix list
  SELECT current token
  CASE variable, value
     append current token to postfix list
  ENDCASE

  CASE unary operator, open bracket
    push token onto stack
  ENDCASE

  CASE binary operator, close bracket
    WHILE precedence(token on top of stack) > precedence(current token)
      pull token from stack, append it to postfix list
    ENDWHILE

    IF current token = close bracket
      pull token from stack, cause error if it's not an open bracket
    ELSE
      push current token onto stack
    ENDIF
  ENDCASE
ENDFOR

However we've been a bit clever and instead of appending operators to a list, we execute them right away. And that's basically it, apart from a few decisions here and there about converting arguments to and from int and reals.

  • TK_ADD, TK_SUB and TK_MUL will pull two arguments from the stack. If both are integers, integer operations will be used, otherwise real operations will be used. If there is one real and one integer, the integer will be converted to a real. The type of the result is whatever operation was used.
  • TK_EQ, TK_NE, TK_LT, TK_LE, TK_GT and TK_GE will also convert to reals unless two integers are present, however the result is always an integer - either 1 for true or 0 for false.
  • TK_DIV, TK_POW and TK_FUNC always convert to reals and return reals.
  • TK_MOD, TK_AND, TK_OR, TK_BNOT, TK_BAND, TK_BOR, TK_BXOR, TK_SHL and TK_SHR always convert to integers and return integers.
  • TK_NEG and TK_ASSN carry the type of their one argument.

How do the 'stacks' work?

I don't use some abstract datatype to operate the stacks, I use basic C operations. I have an array of the appropriate type which has enough entries to hold anything I want. I also have a 'stack pointer', which points at the current topmost item on the stack. It is initialised to -1, to say that the stack is empty. The following C fragments show the canonical stack operations:

valstk = the stack
vstk = the stack pointer

if (vstk < 0)           : if the stack is empty
if (vstk < n)           : if the stack has n items or less on it
valstk[vstk]            : get the item on top of the stack
valstk[++vstk] = item   : push an item onto the stack
valstk[vstk--]          : pull an item from the stack