Basic Parsing Concepts
From Computer Tyme Support Wiki
- GULP - Main GULP Index
Contents |
Routines - Data Structures - Concepts
First - the tools. These are the pieces that need to be created in order to make this work. I will outline the concepts here. These shouldn't be hard to write. Once I define some concepts and routines we'll put it together and give you the big picture.
- NextWord - NextWord is the main text parsing routine. It take the next logical word from the imput stream for processing. If the input stream is X=Y+(Z*5)/10 the NextWord tokens will break down as:
X = Y + ( Z * 5 ) / 10
How do you code NextWord? Probably some giant regular expression. It all depends on how complex you want to make the interpreter. But no mater what you decide to use for a syntax, the idea is that NextWord tokenizes the input stream separating out the individual commands.
- Nugget - After you parse the next work in the input stream you then need to look it up to see what it means. Different tokens maen different things. Is it a constant or a variable or a command. First you test it to see if it's a number. If so - then treat it as such. Is it a string constand? You tell that by looking for "" or whatever else you might use to say "this is a string". Otherwise - it's a command.
Obviously you are going to have to have a table of commands and a way to search for them so when you have a token then you can look it up to see what it is supposed to do. I've has good results using hash tables to find things fast. The hash table stores a pointer that points to the runtime code if you are using an interpreter model, or a token that is compiuled to represent that command so that at run time the token is an offset into and array of pointers that pint to the runtime code. So findToken take a command - looks it up and it returns something thatr represents what the command is supposed to do.
Nugget does a lot of work if you want to be efficient. When get gets constants it has to match them up with some code to process the constant. So it will get a pointer to the executable code along with the constant so that the two can be embedded together into the running program. For example 4 means NumericConstant 4 - maybe even ByteConstant 4 as opposed to 4000 which is WordConstant 4000, or 4.5 which is FloatConstant 4.5. Then you have the string constants which would compile StringConstant "text".
Forth Type runtime Interpreter
This will be ugly code but this is the next step in the process. Lets suppose we are going to write a Forth like language. The input stream is:
4 5 + print
So what we compile for the runtime engine is:
NumericConstant 4 NumericConstant 5 + print
The runtime for operators looks like this:
+ - Pop, Pop + Push Result
In this case + is assumed to be smart so that if it pops strings it concatinates them.
The Instruction Pointer starts at the begining. NumericConstant is a pointer to code and that code grabs the next byte in the input stream and pushes it on the stack. The Instruction Pointer then advances to the next NumericConstant that does the same as the first pushing 5 onto the stack. + points to the runtime code for + that pops the 5 and 4 off the stack - adds them - and pushed 9 on the stack. Then write pops off the 9 and prints it.
And there you have it. It's ugly - but it's an interpreter. And it's gast too. But ugly isn't out goal here. but it is a first step. You can go ahead and code this up and see if you can get it to work because that will be your run time engine.
Compile Time vs. Run Time Definitions
In the previous example the run time engine and the interpreter were the same. Each comand is read and executed in sequence and this the language is computer friendly and not human friendly.
An important concept here is that we are now going to separate the run time definition from the compile time definition so that every command has two meanings. One meaning at run time - and a different meaning at compile time. At compile tome the interpreter doesn't execute the program. It merely finds the executable tokens and arranges them in a memory buffer so that later the run time interpreter runs off the tokens in the memory buffer. This allows the compile time interpreter to read human friendly code and turn it into machine friendly code.
So at compile time the command "+" doesn't mean to add two numbers. It mean find the run time token for "+" and put it into the memory buffer for later execution. The number "2" doesn't mean to put a 2 on the stack. It means to put the run time token for ByteConstant into the memory buffer with a binary 2 after it. At ru time the BtyeConstant token fetches the next byte following itself and pushes it on the stack.
At compile time you are merely setting up the memory buffer for run time execution. Once the memory buffer is set up you start at the beginning and execute the tokens till you get to the end. Of course in complex languages there will be things like jumps and calls and such so it isn't going to be necessarilly sequential, but we'll deal with that later. For now - lets see if we can get it to work with one line expressions.
Making it Pretty
So - how do we turn it around? The word + doesn't mean anything unless we have two numbers to add. The runtime is very efficient that way because 4 5 + is what the computer wants to see. But we want to write "4 + 5".
Remember at compile time the word "+" means to prepare the + tokens - not to actually add anything. In the previous example we had a simple engine. It grabbed the NextWord, found the executable, and ran it. We are now separating that into two stages.
During the compiling phase we look up the NextWord, find the executable token that references the executable code, and we put that in the memory buffer for later execution. An important concept is that we don't have to build tokens in the same order as we see them in the imput stream.
So at compile time the word "+" can be redifined. Rather that just put the + token in the input stream it can be defined to interpret the following word and then put the + token in the input stream. Let's define a new term here called Nugget".
- Nugget - grab the NextWord and process it.
So '+' means - do Nugget - then compile +. That way the string 4 5 + reads 4 + 5. The number "4" compiles ByteConstant 4. Then "+" is interpreted and it says, "Do the next one and then get back to me. So "+" makes "5" compile which build ByteConstant 5 - then "+" puts the runtime token for + in the memory buffer. Thus binary operators like + become human readable.
Now - lets take the word "Write". We were using it at the end to grab the top of the stack and display it. 9 write prints the number 9 on the screen. But let's give "write" a different meaning at compile time so that write interprets the entire line and prints everything left on the stack to the screen. So the syntax is:
Write 4 + 5
"Write" is defined at compile time to keep doing nugget until you get to the end of the line - then it puts the write token into the memory buffer. So "write" is now at the front of the expression making the code pretty. Our compile time definitions look like this.
InterpretLine - Keep doing Nugget till the line is finished. Write - InterpretLine write + - Nugget + * - Nugget * Number - NumericConstant number String - StringConstand string
So you can see the input stream can now look like:
Write 4 * 5 + 3
And it compiles:
ByteConstant 4 ByteConstant 5 * ByteConstant 3 + Write
In the above I've used NumericConstant and ByteConstant as if they mean the same thing. Depending in how simple or complex you want to get they might. For a simple interpreter you might process all variables as strings and change them to numbers at run time. If simplicity is your goal and runtime efficiency and speed are not an issue, that would be a good choice.
However - if you are writing a language where run time execution needs to be faster then when you interpret a number you might evaluate the number at compile time and build a runtime token based on the type of number you are processing. Numbers from 0 to 255 can be compiled into a single byte using ByteConstant. You would also have WordConstant, IntegetConstant, ShortIntConstant, FloatConstant so that the runtime is optimized for the type of number you are passing. Same thing with Strings. You might use strings with a byte length in front of them or a null delimiter.
We are not actually coding the interpreter here. This is about the general concepts behind GULP. You get to choose how you want to make it work.
Using Parens
Every language has some sort of grouping operators like ( ) or { }. I prefer ( ) so I'll use them here. Parens are used to say that certain things go together. It not only controls the execution order - but also groups many tokens into one single logical token to be used as a parameter to something. For example.
Write 5 + ( 6 * 7 )
In this case the "(" will be defined to "keep doing Nugget until you find a matching ")". The parens actually compile no code at all. They just affect the order that code is compiled in. In this case "Write" says to the rest of the line and then compile write. The "5" compiles itself as a constant. The "+" says do nugget first. The "(" is interpreted and it says keep doing Nugget until you find a matching ")". The "6" compiles itself as a constant. The "*" says do a Nugget first and come back to me. The "7" is compiled as a constant. The "*" finishes and compiles the "*" token. The ")" finished the "(" allowing "+" to compile it's token. And at the end of the line the "write" token is compiled. Thus you end up with:
ByteConstant 5 BtyeConstant 6 ByteConstant 7 * + Write
Let's try another example using strings this time. We'll define these string functions:
SubString (String,Position,Count) : String LeftString (String,Count) : String StringPos (SearchString,String) : Number
But at runtime these functions are talking to the data stack in a Forth like mannet.
SubString - Pop Count, Pop Position, Pop String, Push Result LeftString - Pop Count, Pop String, Push Result StringPos - Pop String, Pop SearchString, Push Result Length - Pop String, Push Result
So - how do we process this? It simply works. Thse functions are interpreted at compile time to process the next nugget and then compile its runtime token. And the parens make sure that all the parameters are processed before the runtime token is compiled.
SubString - Nugget, SubString LeftString - Nugget, LeftString StringPos - Nugget, StringPos Length - Nugget, Length
So - let's look at some code. We want to search a string, find the key, and write the next 3 characters after the key. for now we'll ignore that we haven't talked about variables yet and how to assign values to them. But because the goal here is that this is intuitive, we'll just skip a few steps.
St := "abc123xyz" Key := "123" Write SubString(St,StringPos(Key,St)+Length(Key)+1,3)
As before - Write says do the rest of the line first. Substring says do the next token first. Left Paren says keep doing nuggets till you find a matching right paren. St is a string variable so it compiles the ReadVariable token followed by a reference to where that variable can be found. At run time the variable is fetched and pushed onto the stack. Then StringPos invokes Nugget which causes leftparen to run. As with the other paren it keeps doing nuggets till it finds the matching right paren. Then Key compiles ReadVariable and its location reference and St compiles again as a variable. The right paren completes the left paren and StringPos is compiled. The "+" executes the next Nugget first. It in turn invokes Length which executes the next nugget. Left paren keeps executing nugget till it fing the matching right paren. Key compiles ReadVariable and it's reference. The right paren finishes and allows the length token to be compiled. then "+" does nugget causinh ByteConstant 1 to be compiled and the + is compiled. Then ByteConstand 3 is compiled and the right paren lets SubString finally compile. And the end of the line allow write to compile. Whew!
What does this look like at runtime?
ReadVariable St ReadVariable Key ReadVariable St StringPos ReadVariable Key Length + ByteConstant 1 + ByteConstant 3 SubString Write
As you can see - we've taken nice looking human readable code and turned it into nasty Forth like computer friendly RPN code with an extremely simple recursive interpreter. As you can see the parens are used here to create a logical grouping of parameters for funstions. So all functions can be defined as Nugget and then compile the functions token. The function is expected to have onle one parameter - but the onle parameter can be a paren group. Parens aren't necessary if the function is followed by a single parameter.
Length St Length (St)
These compile exactly the same thing.
ReadVariable St Length
In Summary
The main points to understand in this section is that we are taking nice human readable code and compiling it into nasty tokenized Forth like computer friendly code. And we are doing it through a extremely simple recursive Interpreter where a function either adds its run time token immediately, interprets the next Nugget and then compiles it's token, or interprets the rest of the line and then compiles its token. It is simple, fast, and codes easy, and runs fast. The beauty is to keep it simple and elegant.