Introduction

Introduction to Compilers and Runtime Systems

Language Processors

Compilers read a program written in a source language and translate it into an equivalent program in another language.

The target language is typically machine language, which is much faster; a key challenge, however, is error diagnostics, which interpreters provide better.

Java language processors combine compilation and interpretation. The source program is compiled into bytecode (an intermediate form), which a virtual machine interprets.

Structure of a Compiler

At a high level, a compiler is divided into two parts: Analysis and Synthesis. Analysis, sometimes called the compiler front-end, breaks the source code into constituent pieces and imposes a grammatical structure on them. This structure is used to create an intermediate representation of the source program.

If the analysis part determines that the source program is grammatically or semantically incorrect, it will inform the user (developer) of such errors.

The analysis phase also collects information about the source program and populates the symbol table.

The intermediate representation and symbol table are passed to the compiler's synthesis part, sometimes called the compiler back-end.

Synthesis

The synthesis part constructs the desired target program from the intermediate representation. and the information in the symbol table.

Structure conclusion

Most compilers have several phases that form a layer structure; each prepares the output of the preceding phase for the next phase.The symbol table is used by all phases. Some compilers have a machine-independent optimization phase between the front and back end to further transform the IR, allowing the back end to produce a better target program.

Phases

Lexical Analysis

The compiler's first phase is lexical analysis or scanning. The lexical analyzer reads the stream of characters from the source program and groups them into meaningful sequences called lexemes.

For each lexeme, the analyzer produces as output the token of the form
- <token-value, attribute-value>

The token-name is an abstract symbol that is used during the syntax analysis, and the attribute-value is points to information in the symbol table which is needed for semantic analysis.

Example

Suppose that the source program has the following statement:

position = initial + rate * 60

The following lexemes can be be mapped to the following tokens.

position is a lexeme that would be mapped into a token $<id, 1>$ , where id is an abstract symbol (used for syntax analysis) that stands for identifier, and 1 points to the symbol table entry for position (used for semantic analysis). The symbol table entry for an identifier holds information about the identifier, such as its name and type.
The assignment symbol ( = ) is a lexeme that is mapped into the token <=>, which needs no attribute value. Any abstract symbol could be used for the token-name (e.g. assign).
initial is a lexeme that is a mapped into the token $<id, 2>$ , indicating it is another identifier and the points to the second entry in the symbol table (the entry for initial).
+ is a lexeme that is mapped into the token $< + >$ .
rate is a lexeme that is mapped into the token $<id, 3>$ , where 3 points to the symbol-table entry for rate.
Is a lexeme that maps to $<*>$
60 is a lexeme that is mapped into token $<60>$

Blanks separating the lexemes are discarded by the lexical analyzer. The representation after lexical analysis as a sequence of tokens produces the following:

<id,1>\ <=>\ <id,2> \ <+> \ <id,3> \ <*> \ <60>

The token names $=, \ +, \ and \ *$ are abstract symbols for the assignment, addition, and multiplication operators, respectively.

Syntax Analysis

The second phase of the compiler is syntax analysis or parsing. The first arguments of the tokens are used to create a syntax tree intermediate representation, showing the grammatical structure.

A typical syntax tree will use operations as interior nodes, and the children of the node represent the arguments of the operation.

Semantic Analysis

The semantic analyzer uses the syntax tree intermediate representation provided by the syntax analyzer along with information from the symbol table to check the program for semantic consistency. It also gathers type information and saves it in either the syntax tree or symbol table

Type checking is one the primary responsibilities of the semantic analyzer. It must ensure the operand types match the operator.

Example 1: An array index is an integer, so myarray[1.0] should produce an error because the type used to index is a floating point integer.

Sometimes, coercion is allowed, where the compiler will change the type of a number. An example of this would be the use of a binary operator, which work on floating point and integers. If used on both, the compiler may coerce, or change, the integer to a floating point number.

Example 2: In the code above, assyume position, rate, and initial are declared as floating-points; the semantic analyzer changed 60 to the floating-point number 60.0.
- The semantic analyzer adds an extra interior node, inttofloat, which converts the integer 60 to a float 60.0.

Intermediate Code Generation

An optional but frequently used phase, where a compiler translates the tree-like structure into a machine-like or low-level representation of the structure. After receiving the tree structure produced and modified by the syntax and semantic analyzer, respectively, the intermediate code generator will produce a machine like representation.

Two keys properties of intermediate representation:

Should be easy to produce
Should be easy to translate into the target machine.

In the example above, an intermediate form known as three-address code is used, where each instruction has at most three operands. Furthermore, the right side of each instruction will contain at-most one operator (this fixes the order of operations).

Some instructions in three-address intermediate form will use fewer than three operands.

Code Optimizer

A machine-independent code-optimization phase attempts to improve the intermediate code so that better target code can be produced. In most cases, better means faster, however, other objectives such as shorter, less dense, power-efficient also exist.

In most cases, using a simple and straightforward code generator followed by a code optimizer is a good way to optimize IR.

The Rust compiler uses three IR phases: HIR, THIR, and MIR. (Rust compiler documentation).

In the code above, the code optimizer remove intotofloat() and simply changed 60 to 60.0 in the symbol table. It also removed the unnecessary variable t3. These are both examples of optimization that shorten the code. An example optimization:

int main(void)
{
    int a = 0, b = 10, c = 20;

    int d = c * 2;
    a = b + d;
    
    return a;
}

The code above has a few opportunities for optimization. First, multiplying a number by 2, is the same as shifting left 1 bit (e.g. 20 * 2 is the same as 20 << 1). On an ARM chip, the Barrel Shifter can preprocess this operation before the ALU is used, by using a logical left shift (e.g., #LSL 1). Another optimization available is the removal of the variable d and a:

int main(void)
{
    int b = 10, c = 20;
    return b + 20 * 2;
}

Optimized assembly might look like this

; ARM Optimized Assembly

main:
        sub     sp, sp, #16
        str     wzr, [sp, #12]
        mov     w8, #10
        str     w8, [sp, #8]
        mov     w8, #20
        str     w8, [sp, #4]
        ldr     w8, [sp, #8]
        ldr     w9, [sp, #4]
        add     w0, w8, w9, lsl #1
        add     sp, sp, #16
        ret

ARM AARCH64 has 31 64-bit general-purpose registers X0 - X30. The bottom (lower 32-bits) are accessible using W0-W30. There is no register named X31 or W31; instead, when register 31 is used as the SP it's referenced by SP, when used as a zero-register, it's referenced by WZR.

The assembly above shows the Logical Left Shift optimization used with an add instruction. First room is made on the stack for two integers, and values 10 and 20 are stored in the space for b and c, respectively, using register w8:

mov w8, #10, str w8, [sp, #8] // store w8 (holding 10) in sp + offset 8
mov w8, #20, str w8, [sp, #4] // store w8 (holding 20) in sp + offset 4

Next, the the values both loaded

ldr w8, [sp, #8]
ldr w8, [sp, #4]

Finally, the add is performed, with the optimization. Let's walk through the compiler phases.

Lexical Analysis:
- Statement: return b + c * 2
- Lexemes: <id, 1> , <+> , <id, 2> , <*> , <2>
- tokenized statement: $<id,1>\ <+>\ <id,2>\ <*>\ <2>$
Syntax Analysis:

Semantic Analysis: none
Machine-Independent Code Gen:
- t1 = id2 * 2
- t2 = id1 + t1
Machine-Independent Code optimization
- Using logical left shift save cpu cycles
- t1 = id2 << 1
- t2 = id1 + t1
Architecture-specific Optimization
- t2 = id1 + id2 << 1
Code Generator
- add w0, w8, w9, lsl #1

This may not be a perfect walkthrough, but it illustrates the point. The code optimizer recognized logical left shift as a more efficient way of multiplying 20 ( <id, 2> ) by 2, since shifting uses fewer cpu cycles.

The arch-specific optimizer combined everything into one instruction, knowing the that barrel shifter can perform inline shifting.

Finally, the code generated is an ARM add instruction that performs inline shifting using the barrel shifter. The value in w9 (variable c holding 20) will be shifted left by 1 bit inline; that is the Barrel Shifter will perform this operation before the operand w9 is passed to the ALU. Once w9 is passed to the ALU it will not hold 20, but 40. Logical left shift is the same as multiplying by 2.

Next, the add operation is performed by adding w9 [40] to w8 [10] and storing the result in general purpose register w0. According to AArch64, X0 and X1 are registers used for the first integer and float arguments, they are also used for returning those data types. This includes their low 32-bit form, W0 and W1. Storing the result in W0, means that they value will be returned. There is not extra storage in a variable (that was optimized out by me :); however, the optimizer would have done the same.

The instruction add, sp, sp, #16 adds the stack space back and the return will return the result of b + 20 * 2 [10 + 20 * 2], or 50.

Code Generation

The code generator takes as input the (potentially optimized) intermediate representation of the source code and maps it to the target language. If machine code is used, the compiler will also select registers or memory locations for the live variables.

The intermediate instructions are translated into machine instructions that perform the same task. The code generate is responsible for assigning registers, which can be quite challenging. Register allocation using graph coloring is an NP-Complete problem. [Graph coloring is NP-Hard and register allocation is NP].

Implementation

Activities from several phases can be grouped together into a pass that reads an input file and writes an output file. An example of this would be grouping lexical, syntax, and semantic analysis together.

Tools

Parser Generator
- Automatically produces syntax analyzers from a grammatical description of a programming language.
Scanner Generator
- Produce lexical analyzers from a regular-expression description of the tokens of a language.
Syntax-directed Translation Engines
- Produces collections of routines for walking a parse tree and generating intermediate code.
Code-Generator generators
- Produces a code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine.
Data-Flow
- Facilitates the gathering of information about how values are transmitted from one part of a program to each other part.
Compiler-Construction Toolkits
- Provide integrated set of routines for constructing various phases of a compiler.

PreviousAccess Physical Memory NextProgramming Language Basics

Last updated 11 months ago

Was this helpful?