Basic Parsing Concepts

From Computer Tyme Support Wiki

Revision as of 07:30, 2 October 2005 by Marc (Talk | contribs)
Jump to: navigation, search
  • GULP - Main GULP Index

Routines - Data Structures - Concepts

First - the tools. These are the pieces that need to be created in order to make this work. I will outline the concepts here. These shouldn't be hard to write. Once I define some concepts and routines we'll put it together and give you the big picture.

  • NextWord - NextWord is the main text parsing routine. It take the next logical word from the imput stream for processing. If the input stream is X=Y+(Z*5)/10 the NextWord tokens will break down as:
X = Y + ( Z * 5 ) / 10

How do you code NextWord? Probably some giant regular expression. It all depends on how complex you want to make the interpreter. But no mater what you decide to use for a syntax, the idea is that NextWord tokenizes the input stream separating out the individual commands.

  • FindToken(Token) - After you parse the next work in the inpot stream you then need to look it up to see what it means. Different tokens maen different things. Is it a constant or a variable or a command. First you test it to see if it's a number. If so - then treat it as such. Is it a string constand? You tell that by looking for "" or whatever else you might use to say "this is a string". Otherwise - it's a command.

Obviously you are going to have to have a table of commands and a way to search for them so when you have a token then you can look it up to see what it is supposed to do. I've has good resultes using hash tables to find things fast. The hash table stores a pointer that points to the runtime code if you are using an interpreter model, or a token that is compiuled to represent that command so that at run time the token is an offset into and array of pointers that pint to the runtime code. So findToken take a command - looks it up and it returns something thatr represents what the command is supposed to do.

FindToken does a lot of work if you want to be efficient. When get gets constants it has to match them up with some code to process the constant. So it will get a pointer to the executable code along with the constant so that the two can be embedded together into the running program. For example 4 means NumericConstant 4 - maybe even ByteConstant 4 as opposed to 4000 which is WordConstant 4000, or 4.5 which is FloatConstant 4.5. Then you have the string constants which would compile StringConstant "text".

Forth Type runtime Interpreter

This will be ugly code but this is the next step in the process. Lets suppose we are going to write a Forth like language. The input stream is:

4 5 + print

So what we compile for the runtime engine is:

NumericConstant 4 NumericConstant 5 + print

The instruction Pointer starts at the begining. NumericConstant is a pointer to code and that code grabs the next byte in the input stream and pushes it on the stack. The Instruction Pointer then advances to the next NumericConstant that does the same as the first pushing 5 onto the stack. + points to the runtime code for + that pops the 5 and 4 off the stack - adds them - and pushed 9 on the stack. Then write pops off the 9 and prints it.

And there you have it. It's ugly - but it's an interpreter. And it's gast too. But ugly isn't out goal here. but it is a first step. You can go ahead and code this up and see if you can get it to work because that will be your run time engine.

Making it Pretty

So - how do we turn it around? The word + doesn't mean anything unless we have two numbers to add. The runtime is very efficient that way because 4 5 + is what the computer wants to see. But we want to write "4 + 5". how do we turn that around?

Personal tools