Concepts and Semantics of Programming Languages 1. Therese Hardin
object code produced by the compiler may be either binary code encoding instructions directly or source text in assembly code. In the latter case, a program – known as the assembler – must be called to transform this low-level source code into binary code. Generally speaking, assemblers simply produce a mechanical translation of instructions written mnemonically (
mov, add, jmp,
etc.) into binary representations. However, certain more sophisticated assemblers may also carry out optimization operations at this level.
Assembling mnemonic code into binary code is a very simple operation, which does not alter the structure of the program. The reference manual of the target CPU provides, for each instruction, the meaning of the bits of the corresponding binary word. For example, the reference manual for the MIPS32® architecture [MIP 13] describes the 32-bit binary format of the instruction ADD rd, rs, rt
(with the effect rd ← rs + rt
on the registers) as:
Figure 1.4. Coding the ADD instruction in MIPS32®
Three packets of 6 bits are reserved for encoding the register numbers; the other bits in this word are fixed and encode the instruction. The task of the assembler is to generate such bit patterns according to the instructions encountered in the source code.
1.2.3.5. Linking
A single program may be made up of several source files, compiled separately. Once the object code from each source file has been produced, all these codes must be collected into a single executable file. Each object file includes “holes”, indicating unknown information at the moment of production of this object code. It is important to know where to find this missing code, when calling functions defined in a different compilation unit, or where to find variables defined in a location outside of the current unit.
The linker has to gather all the object files and fill all the holes. Evidently, for a set of object files to lead to an executable file, all holes must be filled; so the code of every function called in the source must be available. The linking process also has to integrate the needed code, if it comes from some libraries, whether from the standard language library or a third-party library. There is one final question to answer, concerning the point at which execution should begin. In certain languages (such as C, C++ and Java), the source code must contain one, and only one, special function, often named main
, which is called to start the execution. In other languages (such as Python and OCaml), definitions are executed in the order in which they appear, defined by the file ordering during the linking process. Thus, “executing” the definition of a function does not call the function: instead, the “value” of this function is created and stored to be used later when the function is called. This means that programmers have to insert into the source file a call to the function which they consider to be the “starting point” of the execution. This call is usually the final instruction of the last source file processed by the linker.
A simplified illustration of the different transformation passes involved in source code compilation is shown in Figure 1.5.
Figure 1.5. Compilation process
1.2.3.6. Interpretation and virtual machines
As we have seen, informally speaking, an interpreter “executes” a program directly from the AST. Furthermore, it was said that the code generation process may generate code for a virtual machine. In reality, interpreters rarely work directly on the tree; compilation to a virtual machine is often carried out as an intermediate stage. A virtual machine (VM) may be seen as a pseudo-microprocessor, with one or more stacks, registers and fairly high-level instructions. The code for a VM is often referred to as bytecode. In this case, compilation does not generate a file directly executable by the CPU. Execution is carried out by the virtual machine interpreter, a program supplied by the programming language environment. So, the difference between interpretation and compilation is not clear-cut.
There are several advantages of using a VM: the compiler no longer needs to take the specificities of the CPU into account, the code is often more compact and portability is higher. As long as the executable file for the virtual machine interpreter is available on a computer, it will be possible to generate a binary file for the computer in question. The drawback to this approach is that the programs obtained in this way are often slower than programs compiled as “native” machine code.
2
Introduction to Semantics of Programming Languages
This chapter introduces intuitively the notions of name, environment, memory, etc., along with a first formal description of these notions. It allows readers to familiarize themselves with the semantic approach of programming that we share with a number of other authors [ACC 92, DOW 09, DOW 11, FRI 01, WIN 93].
Any high-level programming language uses names to denote the entities handled by programs. These names are generally known as identifiers, drawing attention to the fact that they are constructed in accordance with the syntactic rules of the chosen language. They may be used to denote program-specific values or values computed during execution. They may also denote locations (i.e. addresses in the memory), they are then called mutable variables. And identifiers can also denote operators, functions, procedures, modules, objects, etc., according to the constructs present in the language. For example, pi is often used to denote an approximate value of π; + is also an identifier, denoting an addition operator and often placed between the two operands, i.e. in infix position, as in 2 + 3. The expression 2 * x + 1 uses the identifier x and to compute its value, we need to know the value denoted by x. Retrieving the value associated with a given identifier is a mechanism at the center of any high-level language. The semantics of a language provides a model of this mechanism, presented – in a simplified form – in section 2.1.
All the formal definitions of languages, instructions, algorithms, etc., given in the following are coded in the programming languages OCaml and Python, trying to paraphrase these definitions and produce very similar versions of code in these two languages, even if developers in these languages may find the programming style used here rather unusual. For readers not introduced to these languages, some very brief explanations are given in the codes’ presentation. But almost all features of OCaml and Python will be considered either in this first volume or in the second, where object-oriented programming is considered. We hope that these two encodings of formal notions can help readers who are not truly familiar with mathematical formalism.
2.1. Environment, memory and state
2.1.1. Evaluation environment
Let X be a set of identifiers and V a set of values. The association of an identifier x ∈ X with a value v ∈ V is called a binding (of the identifier to its value), and a set Env of bindings is called an execution environment or evaluation environment. Env(x) denotes the value associated with the identifier x in Env. The set of environments is denoted as E.
In practice, the set of identifiers X that are actually used is finite: usually, we only consider those identifiers that appear in a program. An environment may thus be represented by a list of bindings, also called Env: