/
gmILIntroduction
gmILIntroduction
- Mark Juras
Owned by Mark Juras
Introduction to gmIL
The name gmIL stands for Great Migrations Intermediate Language. The gmIL is a binary reverse-polish representation. Source languages are compiled into gmIL. That representation is then analyzed and modified to be expressible in a target language.Then the gmIL is authored in that target language. A subset of gmIL can also be executed via an engine within gmBasic. The primary words of gmIL are called Opcodes. Each Opcode defines a basic operation or value reference performed within the intermediate language. There are a total of 206 opcodes. Each opcode has a type and a role which combine to define the actual type of the binary emission used by the operation and the semantic role that operation performs. Most opcodes also have subcodes associated with them. These subcodes can be a simple member of a fixed enumeration, or they can be a value, or an address into the symbol storage area. The opcode and subcode identifiers are then organized into a reverse-polish language like the pseudo-code produced by the first pass of contemporary compilers. The gmIL is this language. It is needed to organize every phase of the translation process.What is a Translator?
In the November/December 1987 Micro/Systems Journal, A.G.W. Cameron reviewed a FORTRAN to C translator. In that review, he took the position that the only difference between a translator and a compiler should be that the compiler translates the source code into assembly code while the translator takes it to a higher level language. The technology described here takes precisely this approach to translation. First, it compiles the source code into a low level pseudo-code, much like the pseudo-code produced by the first pass of contemporary compilers. This is the gmIL. Second, it analyses that code again using the same techniques as would be used by a conventional compiler. Third, it does code generation; but the code generated is not machine code, it is a target language. The techniques used to produce the target language are based on the work done in the area of recreation of source code. In particular, P.J. Brown -- "More on the Recreation of Source Code from Reverse Polish." Software Practice and Experience, Vol. 7, p545-551 (1977) -- discusses the recreation of source code from reverse polish notation. This discussion reviews that work and shows how it can be easily extended via a "surface-form description language" and via "external library descriptions" to allow not only for the surface form variation needed to accommodate different target languages, but also to give the final user of the translator the ability to manipulate the translation to his own needs. If the references upon which this approach is based seem old, they are. The original coding of this approach was done in Fortran for a 1972 Linguistics Masters thesis "SPELL, Systematic Procedures for Equating Logic and Language." For SPELL, the term "Logic" referred to a formal notation used to express meaning and "Language" referred to English. It then quickly became FTSQR, Formula Translation Squared, A Fortran program that converted programs written in nonstandard Fortran into machine independent Fortran. Then, in the early 80's, it branched into PROMULA, Processor of Multiple Languages, and PFC, Portable Fortran Compiler. These are both implemented in C on desktop computers and use C as the target language. The primary difference between PROMULA and PFC is that PROMULA can both directly execute and translate its source code; while PFC translates and then compiles the C along with a runtime library to do the execution. Both PFC and PROMULA have been actively in use since the 80's.Description of Approach
The approach taken by the translator is based on the standard theories of compiler design. Waite and Goos -- Waite, W. M. and G. Goos Compiler Construction. Springer-Verlag, New York, 1984, p. 4in [1] define a compilation as "a sequence of transformations (SL,L1), (L1,L2),...,(Lk,TL) where SL is the source language and TL is the target language." The languages L1 .. Lk are referred to as "intermediate" languages. The problem with the standard theories of compiler design is that though there is ample discussion of how to do the conversion from the source language to the intermediate language, there is little discussion of how to move from an intermediate language back to a higher level target language. For compilers, the target language is always a low-level machine code. Fortunately, there has been some work done in the area of the re-creation of source code from intermediate code. Many BASIC interpreters need to be able to execute BASIC statements efficiently; while still giving the user access to the source code. The re-creationists solve this problem by compiling the source statements into a reverse polish notation and then rewriting the source code from that internal notation. The technique of source code recreation from reverse polish notation hinges on the observation that executing such notation via a stack-oriented pseudo-machine is identical to writing and combining the character sequences that perform those operations in some target language via a stack-oriented string manipulation machine. The above can best be understood via a simple example. Consider the following statement which we wish to both execute via a p-machine and re-createA = B * C + D.
PUSHADR A PUSHADR B GETVAL PUSHADR C GETVAL MULT PUSHADR D GETVAL ADD PUTVAL EOS
OpCode Meaning ------ ------- PUSHADR means push the address of the variable onto the stack. GETVAL means pop the address from the stack, obtain the value at that address, and push it onto the stack. MULT means pop the top two values from the stack, multiply them together, and push the result onto the stack. ADD means pop the top two values from the stack, add them together, and push the result onto the stack. PUTVAL means pop the value and address from the stack and then store the value at the address. EOS means end the current statement.
OpCode Meaning ------ ------- PUSHADR means enter the notation for a pointer to the variable onto the stack. GETVAL means pop the top string from the stack, convert it to the notation for a value at the indicated address, and push that string back onto the stack. MULT means pop the top two strings from the stack, concatenate them with the symbol '*' in the middle, and push that string back onto the stack. ADD means pop the top two strings from the stack, concatenate them with the symbol '+' in the middle, and push that string back onto the stack. PUTVAL means pop the top two strings from the stack, convert the lower one to a value string, concatenate the two strings with the symbol '=' in the middle, and push the resultant string onto the stack. EOS means write the current string.
Instruction Execution stack String-manipulation stack ----------- --------------- ------------------------- PUSHADR A 45 "&A" PUSHADR B 45, 50 "&A", "&B" GETVAL 45, 2.0 "&A", "B" PUSHADR C 45, 2.0, 55 "&A", "B", "&C" GETVAL 45, 2.0, 3.0 "&A", "B", "C" MULT 45, 6.0 "&A", "B*C" PUSHADR D 45, 6.0, 60 "&A", "B*C", "&D" GETVAL 45, 6.0, 4.0 "&A", "B*C", "D" ADD 45, 10.0 "&A", "B*C+D" PUTVAL "A=B*C+D" EOS
The Problem with Parentheses
A problem which was ignored in the above example has to do with parentheses or operator precedence. The reason reverse polish notation is typically used as an intermediate language and even in the design of contemporary machine languages, is that it uses no parentheses. Consider the statementA = B * (C + D)
PUSHADR A PUSHADR B GETVAL PUSHADR C GETVAL PUSHADR D GETVAL ADD MULT PUTVAL EOS
Instruction Execution stack String-manipulation stack ----------- --------------- -------------------------- PUSHADR A 45 "&A" PUSHADR B 45, 50 "&A", "&B" GETVAL 45, 2.0 "&A", "B" PUSHADR C 45, 2.0, 55 "&A", "B", "&C" GETVAL 45, 2.0, 3.0 "&A", "B", "C" PUSHADR D 45, 2.0, 3.0, 60 "&A", "B", "C", "&D" GETVAL 45, 2.0, 3.0, 4.0 "&A", "B", "C", "D" ADD 45, 2.0, 7.0 "&A", "B", "C+D" MULT 45, 14.0 "&A", "B*C+D" PUTVAL "A=B*C+D" EOS
Operation | Precedence |
PUSHADR | 0 |
GETVAL | 0 |
MULT | 2 |
ADD | 1 |
PUTVAL | 0 |
Note that operations not involved in evaluations have a precedence of zero. MULT has a higher precedence then ADD because multiplication is higher in precedence than addition. Using the precedence rule from above the string-manipulation machine can be run again on the intermediate form of A=B*(C+D). The rule and stack precedence codes are shown in parentheses.
Instruction String-manipulation stack ----------- ------------------------- (0) PUSHADR A (0) "&A" (0) PUSHADR B (0) "&A", (0) "&B" (0) GETVAL (0) "&A", (0) "B" (0) PUSHADR C (0) "&A", (0) "B", (0) "&C" (0) GETVAL (0) "&A", (0) "B", (0) "C" (0) PUSHADR D (0) "&A", (0) "B", (0) "C", (0) "&D" (0) GETVAL (0) "&A", (0) "B", (0) "C", (0) "D" (1) ADD (0) "&A", (0) "B", (1) "C+D" (2) MULT (0) "&A", (2) "B*(C+D)" (0) PUTVAL (0) "A=B*(C+D)" EOS
The Translation Process
To understand the "surface-form" description language, an overview of the entire translation process is first needed. This process has three requirements:- The source language must be translatable into an intermediate language all of whose operators are expressed in reverse polish notation and be they unary, binary, ternary, etc. produce either a single result or no result.
- In the target language every n-ary operator can be expressed as an arbitrary but fixed concatenation of its operands and of other fixed character sequences.
- It must be possible to gain access to all operations performed in the source language via the target language.
The Surface-Form Description Notation
The design surface-form description language is based on the first two requirements discussed above. For each operation code in the intermediate language, there is an entry in the surface-form description language. Each operation description can have components which vary by the language being produced. Each entry contains at least the following:- A specification of the number of operands associated with the operation -- i.e., whether the operation is null, or unary, or binary, etc. -- for the particular language. Remember that all operations are reverse polish; therefore, when a given operation is encountered, its operand strings have already been placed on the string-machine stack. The operands are numbered starting with the oldest first. In other words, the operand deepest on the stack is argument 1 and the operand at the top of the stack is argument n, for an n-ary operator.
- A specification of the precedence of the operator relative to others. As the output production proceeds, it is necessary to enclose certain operations in parentheses to achieve the proper order of evaluation. The current precedence of each operand is maintained. When two operators of lower precedence are combined via an operator of higher precedence, then they are enclosed in parentheses.
- A pattern string which specifies an arbitrary but fixed concatenation of the operands and of other character sequences which can be described via a linear pattern string. This pattern string is, of course, the actual implementation of our translation process requirement number (2) as presented earlier. The pattern string describes not only how the operands are combined but also how the various constants, symbol table entries, and miscellaneous special-purpose conversion routines combine to form the final output.
The Example Revisited
Earlier a set of six operation codes was presented along with a verbal description of what each did when executed via the string-machine. This subsection presents the identical information using the surface-form notation in order to introduce the concept. A specification might be as follows.<pattern id="PUSHADR" > <all narg="0" level="0" code="\v"/> </pattern> <pattern id="GETVAL" > <all narg="1" level="0" code="%1i"/> </pattern> <pattern id="MULT" > <all narg="2" level="2" code="%1d * %2d"/> </pattern> <pattern id="ADD" > <all narg="2" level="1" code="%1d + %2d"/> </pattern> <pattern id="PUTVAL" > <all narg="2" level="1" code="%1i = %2d"/> </pattern> <pattern id="EOS" > <all narg="1" level="0" code="%1d\c"/> </pattern>
- Special operation parameters which consist of a backslash
followed by a letter. These parameters trigger special
conversions. Thus, in the above rule for PUSHADR the \v
specifies that a pointer to a variable whose symbol number
is specified following the opcode is to be entered into
the output string at the indicated location. The \c notation
in the EOS string specifies that the current string stack
is to be cleared and thus written to the output file.
- Operand conversion parameters which consist of a percent
sign, followed by a numeric digit, followed by a conversion
code. The numeric digit specifies which operand is to be
entered at this point in the string, and the conversion
code specifies any special operation to be performed. In
the above, the %1d which occurs in the MULT, ADD, and EOS
strings says "Enter the first operand without any
additional editing (other that any precedence editing which
may be needed) into the output string." The %2d says the
same thing for operand 2. The %1i specification in GETVAL
and PUTVAL says "Enter the string in instantiated form --
i.e., change it from a pointer representation to the
representation of the value pointed to."
- Simple character specifications are any characters not forming one of the two specifications above. Simple characters are entered into result strings exactly as entered.
Thec Opcode Functional Grouping
The opcodes themselves are organized into the following functional grouping:Function | Description |
Aithmetic | The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved. |
Generic | The generic operations are the operations that are performed by any language as opposed to being specific to the particular source languages being processed here. |
Patterns | The pattern support operations are used to facilitate the transformations of the opcodes produced by to compiler into opcodes that produce the correct translations when they are authored. |
gmSL | The gmSL support operations are used by the gmSL compiler and runtime support to execute the methods compiled using that language. It also contains a number or dummy opcodes that can be used by gmSL programmers to facilitate migration of codes to specific targets |
Support | The VB6/ASP support operations are used by the source language compiler and to characterize the VB6/ASP particular features. |
Enumerations | The VB6 enumerations represent the various enumerations and their entries. |
Classes | The VB6 classes represent the classes built into VB6 and their individual components. |
Each of these groups is described in detail in this manual.
The Opcode Types
Each opcode has a type that controls the meaning and physical size of the subcodes that it controls. The opcode types are as follows:Type | Description |
Arithmetic | The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved. |
SingleByte | The single byte operators have simple generic roles within the intermediate language like introducing a new code sequence or marking the end of some sequence. |
DoubleByte | The double byte operators have generic roles within the intermediate language. Their emission consists of the operation emission code followed by the suboperation emission code. Within other statements they are referenced as OPC.Subcode. |
ShortValue | The short value operator emission is followed by a two-byte integer value used to specify a simple integer constant or number. |
SymbolAddr | The symbol address operators are followed by a four-byte integer value that specifies the root offset in the current storage area of a symbol defined within the source code. |
StringAddr | The string address operators are followed by a four-byte integer value the specifies the offset in current storage of a character string. This string might be an actual string or it might be a value constant of some sort. |
OpcodeAddr | The opcode address operators are followed by a four-byte integer value that marks the offset of the start of an intermediate language code sequence. |
BinaryType | The binary type operators, not to be confused with Arithmetic binary operators, have a suboperation code that defines a binary type. They are used extensively in the intermediate code to specify the binary result-type of a sequence of operations. |
LibraryAdr | The library address operators are followed by a four-byte integer value that specifies the root in either current storage or language storage of an external symbol in use by the source code. Positive roots are to current storage; while negative roots are to language storage. |
LibPattern | The library pattern operators are followed by the offset of a surface pattern description either in the current storage area or in language storage. |
ObjectType | The object type operators have a suboperation code that specifies the type of object that the some sequence of operations pertain to. |
The Opcode Roles
Each opcode has a role that describes the general role that its operations perform in the intermediate language. The roles adds detail to the type information. They are as follows:Role | Description |
NewLev | There are many nested sequences of operators. This role starts a new nesting level. |
EndLev | There are many nested sequences of operators. This role ends a nesting level. |
LoadSym | This role loads a source code symbol. |
LoadLib | This role loads a library symbol. |
LoadInt | This role loads a short integer value. |
LoadStr | This role loads a character string. |
LoadLng | This role loads a string representation of a long integer constant. |
LoadDbl | This role loads a string representation of a real constant. |
LoadLog | This role loads a binary representation of a logical (boolean) constant. |
LoadDat | This role loads a string representation of a date constant. |
LoadSpv | This role loads a subcode that defines some source language particular special value like Me. |
LoadTyp | This role loads an enumeration entry value. |
Unary | This role defines a unary operator. |
Binary | This role defines a non-relational binary operator. |
Relate | This role defines a relational binary operator. |
Member | This role asserts the membership of a code sequence in the preceding code sequence. |
Clsref | This role asserts that the subcode refers to a class component |
Convert | This role performs a type-conversion of some sort. |
Comment | This role introduces a non-executable comment. |
Table of Contents