The Compiler
Introduction
The compiler is the primary tool of the system builder. It therefore plays a prominent role in the Oberon System, although it is not part of the basic system. Instead, it constitutes a tool module - an application - with a single command: Compile. It translates program texts into machine code. Therefore, it is as a program inherently machine-dependent; it acts as the interface between source language and target computer.
In order to understand the process of compilation, the reader needs to be familiar with the source language Oberon and with the target computer, the NS-32000 processor architecture. For both the reader is referred to the literature [1].
The language is defined as an infinite set of sequences of symbols taken from the language's vocabulary. It is described by a set of equations called syntax. Each equation defines a syntactic construct, or more precisely, the set of sequences of symbols belonging to that construct. It specifies how that construct is composed of other syntactic constructs. The meaning of programs is defined in terms of semantic rules governing each such construct.
Compilation of a program text proceeds by analyzing the text and thereby decomposing it recursively into its constructs according to the syntax. When a construct is identified, code is generated according to the semantic rule associated with the construct. The components of the identified construct supply parameters for the generated code.
It follows that we distinguish between two kinds of actions: analyzing steps and code generating steps. In a rough approximation we may say that the former are source language dependent and target computer independent, whereas the latter are source language independent and target computer dependent. Although reality is somewhat more complex, the module structure of this compiler clearly reflects this division. The tool module Compiler is primarily dedicated to syntactic analysis. Upon recognition of a syntactic construct, an appropriate procedure is called from one of the code generator modules.
Oberon program texts are regarded as sequences of symbols rather than sequences of characters. Symbols themselves, however, are sequences of characters. We refrain from explaining the reasons for this distinction, but mention that apart from special characters and pairs such as +, &, <=, also identifiers, numbers, and strings are classified as symbols. Furthermore, certain capital letter sequences are symbols, such as IF, END, etc. Each time the syntax analyzer (parser) proceeds to read the next symbol, it does this by calling procedure Get, which constitutes the so-called scanner residing in module OCS (for Oberon Compiler Scanner). It reads from the source text as many characters as are needed to recognize the next symbol.
In passing we note that the scanner alone reflects the definition of symbols in terms of characters, whereas the parser is based on the notion of symbols only. The scanner implements the abstraction of symbols. The recognition of symbols within a character sequence is called lexical analysis.
Ideally the recognition of any syntactic construct, say A, consisting of sub constructs, say B1, B2, ... , Bn, leads to the generation of code that depends only on (1) the semantic rules associated with A, and (2) on (attributes of) B1, B2, ... , Bn. If this condition is satisfied, the construct is said to be context-free, and if all constructs of a language are context-free, then also the language is context-free. Syntax and semantics of Oberon adhere to this rule, although with a significant exception. This exception is embodied by the notion of declarations. The declaration of an identifier, say x, attaches permanent properties to x, such as the fact that x denotes a variable and that its type is T. These properties are "invisible" when parsing a statement containing x, because the declaration of x is not also part of the statement. The "meaning" of identifiers is thus inherently context-dependent.
Context-dependence due to declarations is the immediate reason for the use of a global data structure which represents the declared identifiers and their properties (attributes). Since this concept stems from early assemblers where identifiers (then called symbols) were registered in a linear table, the term symbol table tends to persist for this structure, although in this compiler it is considerably more complex than an array. Basically, it grows during the processing of declarations, and it is searched while expressions and statements are processed. Procedures for building and for searching are contained in module OCT.
A complication arises from the notion of exports and imports in Oberon. Its consequence is that the declaration of an identifier x may be in a module, say M, different from where x is referenced. If x is exported, the compiler includes x together with its attributes in the symbol file of the compiled module M. When compiling another module which imports M, that symbol file is read and its data are incorporated into the symbol table. Procedures for reading and writing symbol files are contained in module OCT, and no other module relies on information about the structure of symbol files.
The syntax is precisely and rigorously defined by a small set of syntactic equations. As a result, the parser is a reasonably perspicuous and short program. Unfortunately, the target computer's instruction set is complex, and as a result the program for generating code is much longer and more difficult to comprehend. This is particularly pronounced in the case of a CISC computer such as the NS-32000. Nevertheless, its instruction set is comparatively regular.
Unlike the parser which is fully contained in a single module, code generating procedures are distributed over three modules with the goal of keeping their sizes within reasonable limits (1000 lines). Procedures within module OCE are called mainly when parsing expressions. Apart from generating the corresponding code, the procedures perform the checks for type consistency of operands, and they compute the attributes of the processed construct. As they select the appropriate instructions, they directly reflect the instruction set of the target computer. Procedures in module OCH are of the same nature; they are primarily called when parsing statements instead of expressions.
The final production of code is performed by procedures in module OCC. They are typically called from OCE and OCH. In analogy with the scanner transforming character sequences into symbols, OCC-procedures transform (abstract) instructions into sequences of bits. Hence, this module reflects the binary encoding of instructions, i.e. the target computer's instruction formats.
The resulting module structure of the compiler is shown in Fig 1 in a slightly simplified manner. In reality OCS is imported by all other modules due to their need for procedure OCS.Mark. This, however, will be explained later.
Code Patterns
Before it is possible to understand how code is generated, one needs to know which code is generated. In other words, we need to know the goal before we find the way leading to the goal. A fairly concise description of this goal is possible due to the structure of the language. As explained before, semantics are attached to each individual syntactic construct, independent of its context. Therefore, it suffices to list the expected code - instead of an abstract semantic rule - for each syntactic construct.
As a prerequisite to understanding the resulting instructions and in particular their parameters, we need to know where declared variables are stored, i.e. which are their addresses. This compiler uses the straight-forward scheme of sequential allocation of consecutively declared variables. An address is a pair consisting of a base address (in a register) and an offset. Global variables are allocated in the module's data section and the respective base address register is SB. Local variables are allocated in a procedure activation record on the stack; the respective base register is FP. Offsets are negative integers.
The amount of storage needed for a variable (called its size) is determined by the variable's type. The sizes of basic types are prescribed by the target computer's data representation. The following holds for the NS processor:
Type No. of bytes
SHORTINT, CHAR, BOOLEAN 1 INTEGER 2 LONGINT, REAL, SET, POINTER, PROCEDURE 4 LONGREAL 8
The size of an array is the size of the element type multiplied by the number of elements. The size of a record is the sum of the sizes of its fields.


LinkBack URL
About LinkBacks
Reply With Quote
LinkBacks Enabled by vBSEO
Bookmarks