Re: Translation Scheme

11 Dec 2002


      ...
Well, it's an option. I could write some simple programs in Pascal and 
see what assembly code is being generated.
But I think it should be a formal document with this step. I mean, when 
somebody implements a compiler, he/she must follow several steps: lex, 
parsing, generation of symbol table and object code generation. Well, I 
search for a scheme with the correspondencies between the structures 
being parsed and the code generated.
E.g.:
program:
 program id other body; --> initialize memory
body:
 begin instructions end ';' --> generate instructions 'enter', 'leave'
...
I'm not interested in sources of GPC, because GPC is designed as a 
front-end, and I search for the back-end...
It is still not clear what you really want. I will assume that you 
want to understand how a compiler works (maybe write a simple one).
Gcc is an optimizing compiler, working in many passes and there is 
really NO simple scheme -- genereted code depend in highly "non 
additive" way on the source, in particular at last stages the 
generated code is rearranged to allow more instructions to execute 
in parallel and also some sequences of instructions are replaced by
better ones. As long as I know you can disable most of the optimizations
(using -O0 flag) but some optimizations still are performed. In fact
turning optimizations on/off changes which procedures are used in 
some stages to generate code so the translation scheme really depend
on exact switches you gave to gcc. 
In theory one should precisely describe the expected effect of various 
transformations and only then begin to code the compiler. In practice
tiny little details matter most, and once you spelled out exactly 
every little detail, then you realize that you really have a computer 
program. It it wastefull to code the same computation twice (and there
is little hope that two different programs will perforom the same 
computation anyway), so the only formal description what gcc is doing 
is the source code. If you want informal overview of how gcc works 
besides gcc docomentation you may look at:
http://cobolforgcc.sourceforge.net/cobol_14.html
where you can find probably the best (however incomplete) descripition
of interface between front end and the back end. If you want to know
the exact rules used to produce i386 instructions from gcc-internal data
you may look at gcc/config/i386/i386.md (but that file is cryptic,
I will not dare to modify it).
I think that one general remark is in place: the compiler perform 
translation in many phases, even if each phase is very simple (not 
the case of gcc) the final effect may appear complex -- in other 
words if you try to describe the process as a single step then the 
description becames very complex.
If you want a very simplified decscription of the whole process 
here it goes:
fist stage -- build data structures in compiler:
collect type, variable and procedure declarations,
store procedure bodies as trees representing sequences of 
instructions, loops, conditionals, procedure calls and assignments
the main program is treated as a body of fictional procedure
the first stage is almost independent of the target.
second stage -- allocate variables:
in Pascal basically you compute how much space the variables will 
take
in the second stage you have to know how big basic types are, 
and possibly alignment rules -- on i386 you may use no alignment
but you get better performance (and compatiblity with other compilers) 
if you allocate 2 byte variable on even adresses and 4 byte variables
on adresses divisible by 4 (on newest processors you should also align 
8 byte variables)
now we can generate code:
the bulk of code are expressions, they are trees with simple (binary or 
unary) operators or function calls in the nodes. We may assume that simple 
operators work on integers and that operation correspod to a single machine 
instruction -- other operators are replaced by function calls. One 
big work when translating expressions is register allocation. In simple 
scheme you allocate a temporary variable for each tree node and just 
fetch value from memory before any operation and store the result after
the operation. So 
    x := x + y;
becomes
    movl x,%eax
    movl y,%ebx
    addl %ebx,%eax
    movl %eax,x
if x and y are global variables. For local variables you use its 
offset inside a stack frame like:
    movl -8(%ebp),%eax
for boolean expressions like x := y > z; the code looks:
    movl y,%eax
    movl x,%ebx
    cmpl %eax,%ebx
    jg	l1
    movl $0,%eax
    jmp l2
   l1:  movl $1,%eax
   l2:  movl %eax, x
for conditional instruction if cond then I1 else I2; the code looks:
        Computation of cond
    movl cond, %eax  # cond is a temporary to store value of condition
    cmp $0, %eax
    jz l1
    Translation of I1
    jmp l2
  l1:	Translation of I2
  l2
Capital letter above means you need to expand corresponding fragment
procedure call: foo(expr1, expr2, expr3) -- all parameter by value
    Compute expr3
    push expr3
    Compute expr2
    push expr2
    Compute expr1
    push expr1
    call foo
procedure body:
    Prolog
    Expand body instructions
    Epilog
Prolog and Epilog depend on exact calling convention, with the 
scheme above only ebx and esp need saving so simple enter and 
leave are enough, but gcc wants to have more registers preserved,
so one have to push them on the stack in the Prolog and pop them in 
the Epilog.
Pointer dereference: x := y^;
    movl y, %eax
    movl (%eax),%eax
    movl %eax, x
Array and othere complex data is effectively represented by 
pointers (adresses) -- given address of the whole object compiler 
computes address of a component --- you may reduce the whole Pascal
to simple fragments like above (well they are similar but formally 
you need more such fragments) and adress computations.
If you try to understand compilers you probably look at something 
simpler then whole Pascal -- you may find on the net examples of 
compilers for some subsets of Pascal. The scheme I presented above 
can handle full Pascal, but given in full detail would be long, 
boring and (I belive) not easier to understand due to lot of details 
--- and the resulting compiler would generate really lousy code (IMHO gcc
gives you 10-20 times faster code)
-- 
                              Waldek Hebisch
hebisch@math.uni.wroc.pl    or hebisch@hera.math.uni.wroc.pl

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

Re: Translation Scheme