Introduction to gmIL

The name gmIL stands for Great Migrations Intermediate Language. The gmIL is a binary reverse-polish representation. Source languages are compiled into gmIL. That representation is then analyzed and modified to be expressible in a target language.Then the gmIL is authored in that target language. A subset of gmIL can also be executed via an engine within gmBasic.

The primary words of gmIL are called Opcodes. Each Opcode defines a basic operation or value reference performed within the intermediate language. There are a total of 206 opcodes. Each opcode has a type and a role which combine to define the actual type of the binary emission used by the operation and the semantic role that operation performs. Most opcodes also have subcodes associated with them. These subcodes can be a simple member of a fixed enumeration, or they can be a value, or an address into the symbol storage area. The opcode and subcode identifiers are then organized into a reverse-polish language like the pseudo-code produced by the first pass of contemporary compilers. The gmIL is this language. It is needed to organize every phase of the translation process.

What is a Translator?

In the November/December 1987 Micro/Systems Journal, A.G.W. Cameron reviewed a FORTRAN to C translator. In that review, he took the position that the only difference between a translator and a compiler should be that the compiler translates the source code into assembly code while the translator takes it to a higher level language. The technology described here takes precisely this approach to translation. First, it compiles the source code into a low level pseudo-code, much like the pseudo-code produced by the first pass of contemporary compilers. This is the gmIL. Second, it analyses that code again using the same techniques as would be used by a conventional compiler. Third, it does code generation; but the code generated is not machine code, it is a target language.

The techniques used to produce the target language are based on the work done in the area of recreation of source code. In particular, P.J. Brown -- "More on the Recreation of Source Code from Reverse Polish." Software Practice and Experience, Vol. 7, p545-551 (1977) -- discusses the recreation of source code from reverse polish notation. This discussion reviews that work and shows how it can be easily extended via a "surface-form description language" and via "external library descriptions" to allow not only for the surface form variation needed to accommodate different target languages, but also to give the final user of the translator the ability to manipulate the translation to his own needs.

If the references upon which this approach is based seem old, they are. The original coding of this approach was done in Fortran for a 1972 Linguistics Masters thesis "SPELL, Systematic Procedures for Equating Logic and Language." For SPELL, the term "Logic" referred to a formal notation used to express meaning and "Language" referred to English. It then quickly became FTSQR, Formula Translation Squared, A Fortran program that converted programs written in nonstandard Fortran into machine independent Fortran. Then, in the early 80's, it branched into PROMULA, Processor of Multiple Languages, and PFC, Portable Fortran Compiler. These are both implemented in C on desktop computers and use C as the target language. The primary difference between PROMULA and PFC is that PROMULA can both directly execute and translate its source code; while PFC translates and then compiles the C along with a runtime library to do the execution. Both PFC and PROMULA have been actively in use since the 80's.

Description of Approach

The approach taken by the translator is based on the standard theories of compiler design. Waite and Goos -- Waite, W. M. and G. Goos Compiler Construction. Springer-Verlag, New York, 1984, p. 4in [1] define a compilation as "a sequence of transformations (SL,L1), (L1,L2),...,(Lk,TL) where SL is the source language and TL is the target language." The languages L1 .. Lk are referred to as "intermediate" languages.

The problem with the standard theories of compiler design is that though there is ample discussion of how to do the conversion from the source language to the intermediate language, there is little discussion of how to move from an intermediate language back to a higher level target language. For compilers, the target language is always a low-level machine code. Fortunately, there has been some work done in the area of the re-creation of source code from intermediate code. Many BASIC interpreters need to be able to execute BASIC statements efficiently; while still giving the user access to the source code. The re-creationists solve this problem by compiling the source statements into a reverse polish notation and then rewriting the source code from that internal notation.

The technique of source code recreation from reverse polish notation hinges on the observation that executing such notation via a stack-oriented pseudo-machine is identical to writing and combining the character sequences that perform those operations in some target language via a stack-oriented string manipulation machine. The above can best be understood via a simple example.

Consider the following statement which we wish to both execute via a p-machine and re-create


   A = B * C + D.

The intermediate form of this statement might be as follows


   PUSHADR       A
   PUSHADR       B
   GETVAL
   PUSHADR       C
   GETVAL
   MULT
   PUSHADR       D
   GETVAL
   ADD
   PUTVAL
   EOS

where for the execution pseudo-machine


   OpCode  Meaning
   ------  -------
   PUSHADR means push the address of the variable onto the stack.
   GETVAL  means pop the address from the stack, obtain the value at that
           address, and push it onto the stack.
   MULT    means pop the top two values from the stack, multiply them
           together, and push the result onto the stack.
   ADD     means pop the top two values from the stack, add them together,
           and push the result onto the stack.
   PUTVAL  means pop the value and address from the stack and then store the
           value at the address.
   EOS     means end the current statement.

and for the string-manipulation machine, using ANSI-C notation


   OpCode  Meaning
   ------  -------
   PUSHADR means enter the notation for a pointer to the variable onto the
           stack.
   GETVAL  means pop the top string from the stack, convert it to the notation
           for a value at the indicated address, and push that string back
           onto the stack.
   MULT    means pop the top two strings from the  stack, concatenate them
           with the symbol '*' in the middle, and push that string back onto
           the stack.
   ADD     means pop the top two strings from the stack, concatenate them
           with the symbol '+' in the middle, and push that string back onto
           the stack.
   PUTVAL  means pop the top two strings from the stack, convert the lower
           one to a value string, concatenate the two strings with the symbol
           '=' in the middle, and push the resultant string onto the stack.
   EOS     means write the current string.

Execute these two machines and watch the symmetry between them. Assume that A is stored at address 45, that B is at address 50 and has value 2.0, that C at 55 has value 3.0, and D at 60 has value 4.0. It is both exciting and interesting to note that these address and value assumptions are needed for the execution pseudo-machine only. Alternately, the string manipulation machine needs access to the symbol table and a target language specification, which the execution-machine does not.


   Instruction  Execution stack  String-manipulation stack
   -----------  ---------------  -------------------------
   PUSHADR A    45               "&A"
   PUSHADR B    45, 50           "&A", "&B"
   GETVAL       45, 2.0          "&A", "B"
   PUSHADR C    45, 2.0, 55      "&A", "B", "&C"
   GETVAL       45, 2.0, 3.0     "&A", "B", "C"
   MULT         45, 6.0          "&A", "B*C"
   PUSHADR D    45, 6.0, 60      "&A", "B*C", "&D"
   GETVAL       45, 6.0, 4.0     "&A", "B*C", "D"
   ADD          45, 10.0         "&A", "B*C+D"
   PUTVAL                        "A=B*C+D"
   EOS

Though this simple example only shows how to translate "A=B*C+D" back into itself, its extension to approximately 200 pseudo-operations, and its generalization via a "surface-form" notation, allow for a very fast and sophisticated translator. With a single extension for dealing with operator hierarchies, to be discussed below, the above is all that the translator does. It produces pseudo-code and a symbol table from the source code, simplifies the code by using an execution machine, and then writes the code back out in user-definable target form via a string manipulation machine.

A serious problem faced by the re-creationists, is that the production of the output string is completely independent of the original input made by the user. Though the original and the result mean the same thing, they might look very different. Fortunately, in this application this is not a problem. The user wants semantic and not syntactic identity.

The other important point is that the string machine need not operate against the reverse polish notation as originally formed by the compiler. The string machine can operate on any well-formed representation. This fact is key for the deep-refactoring operations which manipulate the low level notation directly.

The Problem with Parentheses

A problem which was ignored in the above example has to do with parentheses or operator precedence. The reason reverse polish notation is typically used as an intermediate language and even in the design of contemporary machine languages, is that it uses no parentheses. Consider the statement


A = B * (C + D)

in which the addition operation is to be performed prior to the multiplication. Using the same notation as above the intermediate form of this statement would be as follows.


  PUSHADR   A
  PUSHADR   B
  GETVAL
  PUSHADR   C
  GETVAL
  PUSHADR   D
  GETVAL
  ADD
  MULT
  PUTVAL
  EOS

Note that the effect of the parentheses in intermediate form was simply to reorder the operations.

Now, using the same execution and string-manipulation rules as before, the following would result.


 Instruction  Execution stack   String-manipulation stack
 -----------  ---------------   --------------------------
 PUSHADR A    45                "&A"
 PUSHADR B    45, 50            "&A", "&B"
 GETVAL       45, 2.0           "&A", "B"
 PUSHADR C    45, 2.0, 55       "&A", "B", "&C"
 GETVAL       45, 2.0, 3.0      "&A", "B", "C"
 PUSHADR D    45, 2.0, 3.0, 60  "&A", "B", "C", "&D"
 GETVAL       45, 2.0, 3.0, 4.0 "&A", "B", "C", "D"
 ADD          45, 2.0, 7.0      "&A", "B", "C+D"
 MULT         45, 14.0          "&A", "B*C+D"
 PUTVAL                         "A=B*C+D"
 EOS

Obviously the execution result of 14.0 is correct, but the string-manipulation result of "A=B*C+D" , which is identical to the previous result, is wrong. The parenthesis have been omitted. The re-creationists solve this problem by entering special markers into the intermediate notation to indicate where the parentheses occurred. This technique obviously does not work in the translation application, because the placement of the parentheses in the target language is a function of the operator precedence of the target language and not of the source language.

The solution taken is quite simple and general. Each rule in the string manipulation machine has associated with it a precedence code. Whenever the output of an operation is written to the string-stack, the precedence code of the rule that produced the output is also written. Whenever an element is combined within a rule, if its precedence code is nonzero and lower than the code of the rule, then that element is enclosed in parentheses as it is concatenated into the new output.

Applying this extension to the example above, the precedence codes for the various operations are as follows:

Operation	Precedence
PUSHADR	0
GETVAL	0
MULT	2
ADD	1
PUTVAL	0

Note that operations not involved in evaluations have a precedence of zero. MULT has a higher precedence then ADD because multiplication is higher in precedence than addition. Using the precedence rule from above the string-manipulation machine can be run again on the intermediate form of A=B*(C+D). The rule and stack precedence codes are shown in parentheses.


 Instruction    String-manipulation stack
 -----------    -------------------------
 (0) PUSHADR   A    (0) "&A"
 (0) PUSHADR   B    (0) "&A", (0) "&B"
 (0) GETVAL         (0) "&A", (0) "B"
 (0) PUSHADR   C    (0) "&A", (0) "B", (0) "&C"
 (0) GETVAL         (0) "&A", (0) "B", (0) "C"
 (0) PUSHADR   D    (0) "&A", (0) "B", (0) "C", (0) "&D"
 (0) GETVAL         (0) "&A", (0) "B", (0) "C", (0) "D"
 (1) ADD            (0) "&A", (0) "B", (1) "C+D"
 (2) MULT           (0) "&A", (2) "B*(C+D)"
 (0) PUTVAL         (0) "A=B*(C+D)"
     EOS

The critical point in the above occurs when the output of the ADD operation, which has a precedence code of 1, is combined via the MULT operation which has a precedence of 2. Since 1 is nonzero and less than 2, the string "C+D" is enclosed in parentheses as it is entered into the output string of the MULT instruction.

The Translation Process

To understand the "surface-form" description language, an overview of the entire translation process is first needed. This process has three requirements:

The source language must be translatable into an intermediate language all of whose operators are expressed in reverse polish notation and be they unary, binary, ternary, etc. produce either a single result or no result.
In the target language every n-ary operator can be expressed as an arbitrary but fixed concatenation of its operands and of other fixed character sequences.
It must be possible to gain access to all operations performed in the source language via the target language.

The first two requirements are easy to achieve via almost any contemporary language. The third is the killer in so far as the translation of VB6/ASP/COM code into the .NET languages in particular, and really in general. The decision to change the conversion of a large application to a new language is not simply because the new notation is preferred to the older one. It is made because the new language offers access to newer needed facilities. But it is never the case that those new facilities are backward compatible with the ones assumed by the old language. It is the need to adopt the older code not just in the new language but also to the new facilities that create the need for this approach.

The Surface-Form Description Notation

The design surface-form description language is based on the first two requirements discussed above. For each operation code in the intermediate language, there is an entry in the surface-form description language. Each operation description can have components which vary by the language being produced. Each entry contains at least the following:

A specification of the number of operands associated with the operation -- i.e., whether the operation is null, or unary, or binary, etc. -- for the particular language. Remember that all operations are reverse polish; therefore, when a given operation is encountered, its operand strings have already been placed on the string-machine stack. The operands are numbered starting with the oldest first. In other words, the operand deepest on the stack is argument 1 and the operand at the top of the stack is argument n, for an n-ary operator.
A specification of the precedence of the operator relative to others. As the output production proceeds, it is necessary to enclose certain operations in parentheses to achieve the proper order of evaluation. The current precedence of each operand is maintained. When two operators of lower precedence are combined via an operator of higher precedence, then they are enclosed in parentheses.
A pattern string which specifies an arbitrary but fixed concatenation of the operands and of other character sequences which can be described via a linear pattern string. This pattern string is, of course, the actual implementation of our translation process requirement number (2) as presented earlier. The pattern string describes not only how the operands are combined but also how the various constants, symbol table entries, and miscellaneous special-purpose conversion routines combine to form the final output.

The bulk of the discussion below deals with pattern strings. They are deceptively simple.

The Example Revisited

Earlier a set of six operation codes was presented along with a verbal description of what each did when executed via the string-machine. This subsection presents the identical information using the surface-form notation in order to introduce the concept. A specification might be as follows.


<pattern id="PUSHADR" >
   <all narg="0" level="0" code="\v"/>
</pattern>
<pattern id="GETVAL" >
   <all narg="1" level="0" code="%1i"/>
</pattern>
<pattern id="MULT" >
   <all narg="2" level="2" code="%1d * %2d"/>
</pattern>
<pattern id="ADD" >
   <all narg="2" level="1" code="%1d + %2d"/>
</pattern>
<pattern id="PUTVAL" >
   <all narg="2" level="1" code="%1i = %2d"/>
</pattern>
<pattern id="EOS" >
   <all narg="1" level="0" code="%1d\c"/>
</pattern>

First the surface-forms or patterns must specify which target language they relate to. The language identifier all specifies that the pattern relates to all targets. As specified above, the attribute narg specifies the number of arguments and the attribute level specifies the precedence level. The attribute code string is the output pattern string. The basic operation of the pattern strings is straightforward. The output processor moves to the next operation in the intermediate language. It then looks up the pattern information for that operation. It removes as many operand strings from the string stack as are specified in the narg attribute of the specification. It saves these operand strings and the current precedence level associated with each in a temporary buffer. Next, a new string is formed using the actual code string as a guide. Finally, the result string is pushed onto the stack and assigned the hierarchy specified in the level attribute.

Within the pattern strings there are three types of specifications:

Special operation parameters which consist of a backslash followed by a letter. These parameters trigger special conversions. Thus, in the above rule for PUSHADR the \v specifies that a pointer to a variable whose symbol number is specified following the opcode is to be entered into the output string at the indicated location. The \c notation in the EOS string specifies that the current string stack is to be cleared and thus written to the output file.
Operand conversion parameters which consist of a percent sign, followed by a numeric digit, followed by a conversion code. The numeric digit specifies which operand is to be entered at this point in the string, and the conversion code specifies any special operation to be performed. In the above, the %1d which occurs in the MULT, ADD, and EOS strings says "Enter the first operand without any additional editing (other that any precedence editing which may be needed) into the output string." The %2d says the same thing for operand 2. The %1i specification in GETVAL and PUTVAL says "Enter the string in instantiated form -- i.e., change it from a pointer representation to the representation of the value pointed to."
Simple character specifications are any characters not forming one of the two specifications above. Simple characters are entered into result strings exactly as entered.

This is basically all there is to know about pattern strings. There are obviously more specification codes; however, their description here is not important. As can be appreciated, the above notation is extremely powerful.

Thec Opcode Functional Grouping

The opcodes themselves are organized into the following functional grouping:

Function	Description
Aithmetic	The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved.
Generic	The generic operations are the operations that are performed by any language as opposed to being specific to the particular source languages being processed here.
Patterns	The pattern support operations are used to facilitate the transformations of the opcodes produced by to compiler into opcodes that produce the correct translations when they are authored.
gmSL	The gmSL support operations are used by the gmSL compiler and runtime support to execute the methods compiled using that language. It also contains a number or dummy opcodes that can be used by gmSL programmers to facilitate migration of codes to specific targets
Support	The VB6/ASP support operations are used by the source language compiler and to characterize the VB6/ASP particular features.
Enumerations	The VB6 enumerations represent the various enumerations and their entries.
Classes	The VB6 classes represent the classes built into VB6 and their individual components.

Each of these groups is described in detail in this manual.

The Opcode Types

Each opcode has a type that controls the meaning and physical size of the subcodes that it controls. The opcode types are as follows:

Type	Description
Arithmetic	The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved.
SingleByte	The single byte operators have simple generic roles within the intermediate language like introducing a new code sequence or marking the end of some sequence.
DoubleByte	The double byte operators have generic roles within the intermediate language. Their emission consists of the operation emission code followed by the suboperation emission code. Within other statements they are referenced as OPC.Subcode.
ShortValue	The short value operator emission is followed by a two-byte integer value used to specify a simple integer constant or number.
SymbolAddr	The symbol address operators are followed by a four-byte integer value that specifies the root offset in the current storage area of a symbol defined within the source code.
StringAddr	The string address operators are followed by a four-byte integer value the specifies the offset in current storage of a character string. This string might be an actual string or it might be a value constant of some sort.
OpcodeAddr	The opcode address operators are followed by a four-byte integer value that marks the offset of the start of an intermediate language code sequence.
BinaryType	The binary type operators, not to be confused with Arithmetic binary operators, have a suboperation code that defines a binary type. They are used extensively in the intermediate code to specify the binary result-type of a sequence of operations.
LibraryAdr	The library address operators are followed by a four-byte integer value that specifies the root in either current storage or language storage of an external symbol in use by the source code. Positive roots are to current storage; while negative roots are to language storage.
LibPattern	The library pattern operators are followed by the offset of a surface pattern description either in the current storage area or in language storage.
ObjectType	The object type operators have a suboperation code that specifies the type of object that the some sequence of operations pertain to.

The Opcode Roles

Each opcode has a role that describes the general role that its operations perform in the intermediate language. The roles adds detail to the type information. They are as follows:

Role	Description
NewLev	There are many nested sequences of operators. This role starts a new nesting level.
EndLev	There are many nested sequences of operators. This role ends a nesting level.
LoadSym	This role loads a source code symbol.
LoadLib	This role loads a library symbol.
LoadInt	This role loads a short integer value.
LoadStr	This role loads a character string.
LoadLng	This role loads a string representation of a long integer constant.
LoadDbl	This role loads a string representation of a real constant.
LoadLog	This role loads a binary representation of a logical (boolean) constant.
LoadDat	This role loads a string representation of a date constant.
LoadSpv	This role loads a subcode that defines some source language particular special value like Me.
LoadTyp	This role loads an enumeration entry value.
Unary	This role defines a unary operator.
Binary	This role defines a non-relational binary operator.
Relate	This role defines a relational binary operator.
Member	This role asserts the membership of a code sequence in the preceding code sequence.
Clsref	This role asserts that the subcode refers to a class component
Convert	This role performs a type-conversion of some sort.
Comment	This role introduces a non-executable comment.

Table of Contents

gmStudio Documentation

gmILIntroduction