Basic Parsing Concepts
From Computer Tyme Support Wiki
m |
m |
||
Line 27: | Line 27: | ||
NumericConstant 4 NumericConstant 5 + print | NumericConstant 4 NumericConstant 5 + print | ||
- | The | + | The Instruction Pointer starts at the begining. NumericConstant is a pointer to code and that code grabs the next byte in the input stream and pushes it on the stack. The Instruction Pointer then advances to the next NumericConstant that does the same as the first pushing 5 onto the stack. + points to the runtime code for + that pops the 5 and 4 off the stack - adds them - and pushed 9 on the stack. Then write pops off the 9 and prints it. |
And there you have it. It's ugly - but it's an interpreter. And it's gast too. But ugly isn't out goal here. but it is a first step. You can go ahead and code this up and see if you can get it to work because that will be your run time engine. | And there you have it. It's ugly - but it's an interpreter. And it's gast too. But ugly isn't out goal here. but it is a first step. You can go ahead and code this up and see if you can get it to work because that will be your run time engine. | ||
+ | |||
+ | == Compile Time vs. Run Time Definitions == | ||
+ | |||
+ | In the previous example the run time engine and the interpreter were the same. Each comand is read and executed in sequence and this the language is computer friendly and not human friendly. | ||
+ | |||
+ | An important concept here is that we are now going to separate the run time definition from the compile time definition so that every command has two meanings. One meaning at run time - and a different meaning at compile time. At compile tome the interpreter doesn't execute the program. It merely finds the executable tokens and arranges them in a memory buffer so that later the run time interpreter runs off the tokens in the memory buffer. This allows the compile time interpreter to read human friendly code and turn it into machine friendly code. | ||
+ | |||
+ | So at compile time the command "+" doesn't mean to add two numbers. It mean find the run time token for "+" and put it into the memory buffer for later execution. The number "2" doesn't mean to put a 2 on the stack. It means to put the run time token for ByteConstant into the memory buffer with a binary 2 after it. At ru time the BtyeConstant token fetches the next byte following itself and pushes it on the stack. | ||
+ | |||
+ | At compile time you are merely setting up the memory buffer for run time execution. Once the memory buffer is set up you start at the beginning and execute the tokens till you get to the end. Of course in complex languages there will be things like jumps and calls and such so it isn't going to be necessarilly sequential, but we'll deal with that later. For now - lets see if we can get it to work with one line expressions. | ||
== Making it Pretty == | == Making it Pretty == | ||
So - how do we turn it around? The word + doesn't mean anything unless we have two numbers to add. The runtime is very efficient that way because 4 5 + is what the computer wants to see. But we want to write "4 + 5". how do we turn that around? | So - how do we turn it around? The word + doesn't mean anything unless we have two numbers to add. The runtime is very efficient that way because 4 5 + is what the computer wants to see. But we want to write "4 + 5". how do we turn that around? |
Revision as of 15:15, 2 October 2005
- GULP - Main GULP Index
Contents |
Routines - Data Structures - Concepts
First - the tools. These are the pieces that need to be created in order to make this work. I will outline the concepts here. These shouldn't be hard to write. Once I define some concepts and routines we'll put it together and give you the big picture.
- NextWord - NextWord is the main text parsing routine. It take the next logical word from the imput stream for processing. If the input stream is X=Y+(Z*5)/10 the NextWord tokens will break down as:
X = Y + ( Z * 5 ) / 10
How do you code NextWord? Probably some giant regular expression. It all depends on how complex you want to make the interpreter. But no mater what you decide to use for a syntax, the idea is that NextWord tokenizes the input stream separating out the individual commands.
- FindToken(Token) - After you parse the next work in the inpot stream you then need to look it up to see what it means. Different tokens maen different things. Is it a constant or a variable or a command. First you test it to see if it's a number. If so - then treat it as such. Is it a string constand? You tell that by looking for "" or whatever else you might use to say "this is a string". Otherwise - it's a command.
Obviously you are going to have to have a table of commands and a way to search for them so when you have a token then you can look it up to see what it is supposed to do. I've has good resultes using hash tables to find things fast. The hash table stores a pointer that points to the runtime code if you are using an interpreter model, or a token that is compiuled to represent that command so that at run time the token is an offset into and array of pointers that pint to the runtime code. So findToken take a command - looks it up and it returns something thatr represents what the command is supposed to do.
FindToken does a lot of work if you want to be efficient. When get gets constants it has to match them up with some code to process the constant. So it will get a pointer to the executable code along with the constant so that the two can be embedded together into the running program. For example 4 means NumericConstant 4 - maybe even ByteConstant 4 as opposed to 4000 which is WordConstant 4000, or 4.5 which is FloatConstant 4.5. Then you have the string constants which would compile StringConstant "text".
Forth Type runtime Interpreter
This will be ugly code but this is the next step in the process. Lets suppose we are going to write a Forth like language. The input stream is:
4 5 + print
So what we compile for the runtime engine is:
NumericConstant 4 NumericConstant 5 + print
The Instruction Pointer starts at the begining. NumericConstant is a pointer to code and that code grabs the next byte in the input stream and pushes it on the stack. The Instruction Pointer then advances to the next NumericConstant that does the same as the first pushing 5 onto the stack. + points to the runtime code for + that pops the 5 and 4 off the stack - adds them - and pushed 9 on the stack. Then write pops off the 9 and prints it.
And there you have it. It's ugly - but it's an interpreter. And it's gast too. But ugly isn't out goal here. but it is a first step. You can go ahead and code this up and see if you can get it to work because that will be your run time engine.
Compile Time vs. Run Time Definitions
In the previous example the run time engine and the interpreter were the same. Each comand is read and executed in sequence and this the language is computer friendly and not human friendly.
An important concept here is that we are now going to separate the run time definition from the compile time definition so that every command has two meanings. One meaning at run time - and a different meaning at compile time. At compile tome the interpreter doesn't execute the program. It merely finds the executable tokens and arranges them in a memory buffer so that later the run time interpreter runs off the tokens in the memory buffer. This allows the compile time interpreter to read human friendly code and turn it into machine friendly code.
So at compile time the command "+" doesn't mean to add two numbers. It mean find the run time token for "+" and put it into the memory buffer for later execution. The number "2" doesn't mean to put a 2 on the stack. It means to put the run time token for ByteConstant into the memory buffer with a binary 2 after it. At ru time the BtyeConstant token fetches the next byte following itself and pushes it on the stack.
At compile time you are merely setting up the memory buffer for run time execution. Once the memory buffer is set up you start at the beginning and execute the tokens till you get to the end. Of course in complex languages there will be things like jumps and calls and such so it isn't going to be necessarilly sequential, but we'll deal with that later. For now - lets see if we can get it to work with one line expressions.
Making it Pretty
So - how do we turn it around? The word + doesn't mean anything unless we have two numbers to add. The runtime is very efficient that way because 4 5 + is what the computer wants to see. But we want to write "4 + 5". how do we turn that around?