Tiny Cobol Compiler Overview
----------------------------



Introduction
------------

The COBOL compiler was intended to be constructed with proven tools like 
lex/flex and byacc (standard yacc from Berkeley) with a C code generator.  
They are applied in that order for the files:
scan.l (scanner), htcobol.y (parser), htcobgen.c (code generator, listings
generator and symbol table management functions).
The output is assembler with AT&T syntax, standard for the Linux environment.
We plan to do first a ANSI-1974 compliant version, with extensions for
embedding SQL access and an access to tcl/tk libraries for writing visual
(GUI) applications, but later we expect to evolve to a full '85 version.






Scanner (tokenizer)
------------------

You will find a large table, reserved_symbols[], that defines all tokens
recognized by the scanner as reserved symbols.  
When adding a new token at the parser, the same token must be added here.  
The first entry is the token string, as it will be read by the scanner. 
It shall be in uppercase, because the lookup() function convert everything 
before looking at the table (but don't write back at the token buffer). 
If something is not found as a reserved word, the case matters (upper/lower). 
The second entry is the same token definition you entered in the parser 
(htcobol.y). The third entry (minor) is a minor token  number and may be 
removed in a future release. It was needed in our limited version 
of lex for the Ms-DOS(TM) environment, so we shared tokens with different 
minor codes. Now they are history and gradually will be removed.

The scanner is controlled by the parser, to make it context-dependent, so
become easier to know what to looking for.  

	(>>>> Note: for instance, there was found a hard to fix reduce/reduce 
	conflict because of "NOT ON OVERFLOW", that confused the parser because 
	the token NOT was usually found in conditions, for instance in the IF
	statement. That conflict was removed by a dirty trick, defining a new 
	state and changing NOT to another "kind of NOT", we have called it 
	NOTEXCEP. It's another token, dependent on the context, but is the very 
	same string: 'N','O','T'. In the user point of view it is 
	indistinguishable from the other.)

At the very beginning of the scanner
code, there is a large switch statement:

  switch (curr_division) {
      case CDIV_IDENT:
		  scdebug("-> IDENT_ST\n");
		  BEGIN IDENT_ST;
		  break;
      ...
  }
  curr_division = 0;  /* to avoid new state switching */

So each state only look for the tokens it is expected to find.  For instance,
at the COMMENT_ST state, everything that matches the regular expression 

{letters}(({alphanum}|-)*{alphanum}+)?

will be matched and no token is returned to the parser (when it calls yylex()),
except if one reserved token DIVISNUM (any COBOL division identifier) is found.
This way, it  consumes any input from the source program until a new division 
is found.  It is entered in the IDENTIFICATION DIVISION, when the parser 
(htcobol.y) executes the following:

	identification_division:
		PROGRAM_ID "." IDSTRING EOS {
			curr_division = CDIV_COMMENT;
			pgm_header($3); }
	 ;
						
As we see, first the program-id is parsed and stored.  As it's the only thing
that's really matters here, we discard anything else after the program-id,
until we find the next division token: ENVIRONMENT. 

Several recognizers may be grouped with the same starting states in the
scanner, surrounding them with something like:

	<COMMENT_ST>{
	...  (several lex declarations)
	}

Instead of COMMENT_ST, we can put several states for which the same
declarations apply, like in <INITIAL,ENVIR_ST,DATA_ST,REDEF_ST>, or even 
<*> to enter a global (all states) declaration. This last case must be done
with extreme care, because it affects already tested states. Probably you don't
want to do it.  Another special starting state we use is <<EOF>>. Please
consult flex manual or any good lex/flex book to read more about it. In our
case, it signals the end of a "copy" (include file) operation.

One of the goals of making this scanner with several main states, was because
we can distinguish between variables and labels (paragraphs and sections), so
we can avoid several tokens look-ahead in the parser.  (There is a discussion
at the mailing list of GNU-Cobol2C compiler project about this topic)
Our compiler stores all variables during the DATA_ST state (corresponding to
data division) and already knows what is a variable during INITIAL state
(procedure division, default state), so it returns 2 distinct tokens for a
variable or a paragraph or section identifier (VARIABLE and LABELSTR,
respectively).  
This is needed because, for instance, a  

	PERFORM IDENT-1 OF IDENT-2 IDENT-3 TIMES

and an statement like

	PERFORM IDENT-1 OF IDENT-2 TIMES <statements> END-PERFORM

where in the first case, IDENT-1 is a paragraph name and in the second, IDENT-1
is a variable (field) name. This example need to lookahead 3 tokens 
(if IDENT-1 and IDENT-2 is not qualified), and in general as much as 
50 lookahead!  The other possible solution is to do our
parsing with a better tool than yacc (btyacc is a backtracking yacc, see at
tiny-cobol's home page links section). With our simple solution, IDENT-1 is
know to be a VARIABLE or a LABELSTR (if no variable was found), so the
lookahead is not needed, and we can stay with regular yacc, besides the parser
generated is much faster than with btyacc (if lookahead is needed).





Symbol table organization
------------------------

The symbol table use a hash to choose one of HASHLEN entries and store ach
symbol in this thread, according a hash() function applied on the symbol name
characters.  For each value of "litflag" we can have:

litflag  |  actual structure | it is meant for
---------+-------------------+---------------------------------------------
  0      |  struct sym       | general symbols (files, fields, ws, linkage) 
  1      |  struct lit       | literals
  2,',', |  struct vref      | variable references to compute array indices
  '+','-'|    "      "       |    "         "           "       "     "
---------+-------------------+---------------------------------------------

All those structures must have "litflag" as the first field so we can make a
pointer conversion when accessing the actual storage.  In addition, the
structures "lit" and "sym" must share the following representation:

struct XXX {
   char litflag;     
   struct XXX *next;  
   char *name;        
   char type;
   int  decimals;
   unsigned location; 
   unsigned descriptor; 
   ...                   /* the rest of the particular structure */
};

where XXX = lit or sym.

How the subscripting works?  Let's see the representation of a subscripted
variable reference.
Suppose the following COBOL statement: MOVE 5 TO VAR ( I + 1, J - 2 )

where I,J are numeric variables (anyone, not just "indexed by").
In the parser we need a "struct sym" to reference it in a call to

	gen_move( struct sym *sy_src, struct sym *sy_dst )

where sy_src is the source of the moved field (can be a literal too, of
course), and sy_dst is the destination variable.  The definition of this
function is very simple indeed:

gen_move( struct sym *sy_src, struct sym *sy_dst ) {
	gen_loadvar( sy_dst );
	gen_loadvar( sy_src );
	asm_call("move");
}  
that will generate a "push" in the stack for the representation of the 2
references and call the runtime library function "move".  The hard work is done
by the function gen_loadvar.  It inspects first the litflag of the received
argument to see if it is really a symbol (struct sym), or a literal (struct
lit) or yet a subscripted/indexed variable reference (struct ref) and decides
what to do depending on it's value.
The result will be the generation of code to push two values (unless the
reference is a NULL) at the runtime stack: 
(1) the "struct fld_desc" of the field; (2) a pointer (char *) to the field
storage. (please look also the section on code generation below) 

Returning to the subscripting/indexing stuff, when the gen_loadvar find
(really at gen_loadloc) a litflag=2, meaning a "struct vref" is aliased,
it calls gen_subscripted to generate the code for computing the offset for
accessing the array element following the list of variable references in vref.
In the example given above, VAR ( I + 1, J - 2 ) will be represented as a list
with the following values: 

(Notes: the headers are the fields of "struct vref"
        the addresses are fictitious)

address	| litflag | next  | sym->name
--------+---------+-------+-----------
80001   |  '\x2'  | 80012 | VAR
80012   |  '+'    | 80035 | I
80035   |  ','    | 80047 | 1
80047   |  '-'    | 80059 | J
80059   |  ','    | NULL  | 2

Other function related is value_to_eax, that generate code for loading in the
register %eax the value of a subscript variable (I or J above). This function
will be extended to include variables with the "usage is comp" clause (for
working with real indices, not only subscripts).


--- more to be added later ---




The parser
----------

--- to be written  (any takers?) ---






Code generation
---------------

Our output is assembly language, with C conventions for passing arguments to
functions.  In Cobol, we cannot handle null-terminated strings like we do in C,
because there could be any binary value for a character (including 0x00, the
null char), and the fields in COBOL are fixed length.  Then, most of the
library functions require a description of the field that's supposed to be
altered in any way.  Each COBOL variable is represented by two pointers:

* a fld_desc (field descriptor) pointer

* a storage pointer (the real buffer contents)

When we finally implement the "comp" data type (and also the "float", "pointer"
and other data types), we may see other layouts for this argument passing.
Here is a description of each entry:

struct fld_desc {
   unsigned short len;
   char type;
   unsigned char decimals;
   unsigned char all;
   char *pic;
};
					
This is a static structure (seen by the library code), and should not be
changed anyway by library code. It's components are the description of our
variable pointed by the second argument: "len" is the length of the field in
bytes, "type" is the field type ('G' for groups, 'C' for comp-3 numeric,
'9' for elementary numeric fields, ...). There's a short (but incomplete)
description of them in other info/*.txt file.
The component "decimals" is, for numeric fields, how many decimals positions 
there are after the "V" assumed decimal point, if positive. 
Otherwise, how many "P"s to the left there are for negative values.  In other 
words, the scaling of the numeric variable.
The component "all" is a flag to represent if the 'ALL' flags was defined (for
literals) in such case the variable should be continued as required by a move
operation (wrap-around at the end).

Why don't simply put the pointer to the variable's buffer in it's descriptor?
This wouldn't work, because variables may be passed to sub-programs (calling
another COBOL program) and it's storage is defined at a stack frame, so it's
very volatile when externally linked.

Compressed fields and signs are stored like IBM does in it's compilers, with
the sign at the rightmost (?I'm not sure now, please correct me if I'm wrong)
position, and all digits bcd-coded.

Files are different things, because they need different information. Please
look at the "struct file_desc" (htcoblib.h) to see it's components.

-- to be better described later --




Notes on interfacing with the library functions
-----------------------------------------------


Let us see a code for a typical function generation:

At the parser, we detect the ADD COBOL verb and it's arguments 

	statement:
	 ...
	 | ADD { }
			gname req_to     { $<ival>$=ADD; }
			var_list
	 ...			 

Here "gname" is a non-terminal describing any variable name or literal, or some
figurative constants; "req_to" is a non-terminal that ensures a TO was
detected (it's not simply TO, because of the minor codes I've told about when
explaining the scanner); the action { $<ival>$=ADD } makes the stacked value of
this action equals to the token code ADD, so we can share several statements
with the same productions in "var_list"; finally "var_list" is _the_ code
generating production.  Let's see how it works:

	var_list:
		var_list opt_sep gname
		...
		else if ($<ival>0 == ADD)
				   gen_add($<sval>-2,$<sval>3);
		...

It's a recursive declaration that generates and ADD instruction for each
variable detected at the list.  For instance, suppose we are parsing:

    ADD 1 TO VAR-1  VAR-2  VAR-3

this will generate the same code as if we have done instead:

	ADD 1 TO VAR-1
	ADD 1 TO VAR-2
	ADD 1 TO VAR-3

Of course, this could be much optimized , but let's keep things simple 
for now.
	The test (if condition) of ($<ival>0 == ADD) will tell us if this is really
the ADD statement (not MOVE, nor SUBTRACT, ...), because it looks one token
before reaching the present yacc stack position. This is called an "inherited 
attribute" in compiler theory notation. We are really looking at that action 
value we talked above.  The need of typing the value with $<ival>$ is because
an action cannot be named as we do with other non-terminals (it's typeless),
but share the same stack space as all other terminals and non-terminals, as
defined by the %union yacc statement. Please look at a good compiler book to
understand better that, or I have no way to help you.

Now we need to use another inherited attribute to access our left-hand variable
(before the action and  before the "req_to" at the ADD production), counting
back we get it's value -2 stack positions far away, that's why the first
argument for gen_add() will be $<sval>-2.  The other argument is the right-hand
variable we are just parsing, or $3.  Here there is no need to typify it,
because it's a known non-terminal of the type "sval" (for "symbol value"). 
BTW, the "ival" means "integer value".  See the %union statements at the
beginning of htcobol.y to get a full picture of this.

At the code generation side, we have the following code-generating function:

	void gen_add( struct sym *s1, struct sym *s2 ) {
	   gen_loadvar( s2 );
	   gen_loadvar( s1 );
	   asm_call("add");
	}
			
the function gen_loadvar() generate the code for pushing the "struct fld_desc
*" and "char *" (the buffer) for the variable which was given (s2 or s1).
Remember that in C calling conventions, the first variable seen must be at the
top of stack, so we push it in reversed order.  Each variable occupy 8 bytes of
the stack as discussed before (2 pointers).  The asm_call() function generate
the code for calling the library function and take care of cleaning the stack.
This auto-cleaning is only possible if you don't write code to push variables
manually (like fprintf(o_src,"\tpushl\t%%eax\n") for pushing %eax register), as
this keeps the counter with the wrong value.  You shall use the function
push_eax() instead.  Please look for this section at htcoblib.c. (search
push_eax and look around!)


-- I'll write more later. Please be patient. ...or write it yourself! --





Some random notes
-----------------

As we work within a very heterogeneous group, we have to ensure that the 
compiler is always usable (runnable). Otherwise, other developers working
on another part of the compiler or run-time library, would not be able to
check-in their implementations with the CVS server.

So if you want to do a large number of changes, that will make the compiler 
temporarily unusable, please create a new branch on your computer, and do your 
changes and tests there. But please, don't update the main development branch 
with unusable code. Read the CVS manual for more information.

Our first rule is: "the compiler must compile all times !"


Rildo Pragana

Modified by: David Essex
