January 25, 2011

use of compilers and interpreters as Language translators

Language is for communication. In order to communicate with 2 person, we need a common language. If no such language exists, we need a translator to perform the communication. In case of computers, it can understand only electric pulses(Digital Signals). We will represent the signals using binary number system(1/0 for low/ high). But writing a program in binary is very difficult (technically, not feasible). So for ease of use we can use some words to group common operations and these mnemonic language is called as assembly language. Still, writing big programs are very difficult in assembly language. And the programs written in one machine may not work in other machines. So a situation arise to create a more generic, platform independent, easy to use, simple to debug language, and a program written in this language should be easy maintainable.

A translator program for a computer accepts a program written in a language, which is not understandable by the computer and translate to the machine language of the target machine. The input program is known as source code and the output program is known as the object code. So for a translator the source code is a program written in a human understandable language and the object code is a program in a machine language code.

In the case of assembly language, we will write programs using simple words called mnemonics. But computers cannot understand this language, since they are working using electronic pulses. Here, we are required to use a translator, which convert the mnemonics to its corresponding machine language ie, binary. This translator is called ASSEMBLER. So an assembler is a translator program, which accepts a program written in an assembly language and produce an output file contains the corresponding machine language code of the target system. So for an assembler, the source code is an assembly language program and the object code is a machine language program.

In the evolution of the computer languages, the next step is High Level Languages. These languages meet all the criteria explained in the first paragraph. In the old ages, some of the computers are manufactured with in-built features to execute a high level language. eg: BBC Micro. It has an interpreter of a variant of the most famous language BASIC in it. So many computers released in this manner. These interpreters can be used to execute the programs written in BASIC. The first translators used to execute HLL programs are interpreters. But later Compilers came into the picture and now some of the modern technologies use both compilers and interpreters.

Inorder to execute a HLL language program, somebody has to translate the program to the computer. The interpreter is used to do the same job. This takes a program written in a HLL, parse it statement by statement, if the syntax of the statement is correct, it feed the statement to the processor for execution. In this case, for every time we need to execute the program, first we have to run the interpreter, and give the source file to the interpreter. The interpreter has to be written in some low level language and can be directly executed without the help of other programs. This will not create any object code and the source is required every time to execute. Since the statements are executed statement by statement, the interpreter will execute a program even though it contains an error in the middle of the program. It stops only at the time of finding an error in the program. This strategy is good for debugging. But a very dangerous situation is, at the time of test run, a path of execution is missed and the product is delivered. What can be done, if the path contains a syntax error?

Compiler is used to convert a HLL program into LLL program. In this case, the translator will create an object code, store permanently and that file can be directly executed. The compiler is a program accepts a source program written in a HLL and convert in to its corresponding LLL. Once if we compile a HLL program, we will get an executable program and that program can be executed without the help of the compiler. So the compiler is required only for compilation, not for execution.

| source code |---->| Compiler |---->| Object code |

The linker is a program which will add libraries to the compiled program. After compilation, we will get an object code, which required further processes. Libraries are common functions which will be stored in separate files. eg: input/output functions, math functions etc. These functions are compiled and stored in separate files. If we used these functions in our program, the compiler will mark this as an external symbol in the generated object code. At the time of linking, the linker will search for these unresolved external symbols, try to locate this in library, if find, will add to the object code and make the call address proper (called resolving external symbols). After linking, if any unresolved symbol present in the object code, the linker will give an error and the executable will not be created. If all the external symbols are resolved, the linker will generate an output file, which is directly executable.
The above said are in the case of static libraries. If there is any dependency to a dynamic library present in the object code, the linker will mark those symbols separately and will be resolved at the time of execution. This will be performed by a LOADER-part of the OS.

Header files in C
The header files in C or C++ has function prototypes (eg: printf/scanf in stdio.h. Header files contains only function prototypes, not function declarations), user data type declarations (eg: FILE structure in stdio.h), constants (eg: MAXINT in limits.h), macros (eg: and in iso646.h) etc. The function definitions present in library files.

This program will load an executable program into memory for execution. The executable are re-locatable: can load any part of the memory. This program will check whether any dynamic libraries are required for the execution. If so, those libraries will also be loaded into the memory, whenever required.

Difference between Compilers and Interpreters
The main difference is the compiler will generate an object code at the time of compilation, but Interpreter will execute without generating any object code.
So there are some advantages and disadvantages for both of these techniques.

1. Will create an object code, so compilation is required only once.
2. Create object code only if the source is error free.
3. List all the errors at the end of the compilation and has to be corrected one by one.
4. The generated code is platform dependent and so multiple compilation and executable is required for different platforms.
5. Compiler and the source code is required only at the time of compilation. After that the executable code alone will work.

1. Will execute the program without creating any object code, so every time the process of interpretation is required.
2. Will execute even though the program is erroneous or the program is incomplete. The execution will stop at the first error.
3. Prompt an error only at the time of execution that statement.
4. If the source code is platform independent, interpreter can execute the same program in any platform. Only the Interpreter is platform dependent.
5. Both the source code and the interpreter is required to execute the program every time.

Virtual machines
This is a special case of translation. In this scenario, the source code will be compiled into an intermediate language. This can be a machine language of a hypothetical computer. The virtual machine will interpret this intermediate language for the target PC. Since the intermediate language is not created for a particular implementation, it can be considered as platform independent. The Virtual machine will take care of the execution of this IL program. So the VM should be different for different platform and it in the absence of this VM, the IL program is not usable. A well known example is JAVA.
Here the compiler and interpreter is connected in series.

1. A compiler will compile the source and create an intermediate language.
2. An Interpreter will take this IL program and executes.

So the concept of both Compiler and Interpreter can be explained.

As I mentioned, the Compiler is a program, which will generate an object code from the source code written in the HLL. Since it is a program, the design consists the input we have to give and the output we should get. As you know, the input is a source code written in the above specified language. And here in this case the output is an intermediate language program. So the intermediate language should be defined. But before going to that, I would like to describe something about the design of a compiler.

The parser is the main module of a compiler. This module is responsible for checking the syntax and semantics of the source code. Parser will call the Lexical analyzer to scan the source. Lexical analyzer returns the next Token and Lexeme to the Parser. If the syntax is correct for the current statement, the Parser instructs the Code generator, how to produce the code. The Code generator produces a code (an IL representation) and returns to the Parser. Once this operation is performed for the complete source, an optimizer may be called. The Optimizer analyses the IL code and improve the quality and generate a final code.

So the design of the compiler includes, a Lexical analyzer, a Parser, a Symbol table handler, a Code generator(In this stage, the IL will come into picture), and an optimizer.

As a first step, the Lexical analyzer. In order to understand the working, some knowledge in Regular expression and Finite Automata is required. And it is very difficult to explain those concepts in between. So I assume, you have some basic understanding.

The Lexical analyzer has to return a Token and corresponding Lexeme. Generally, a Token is a constant to represent the group of the word found in the source code and the Lexeme is the corresponding text. eg: for a variable 'avariable', the token should be a constant, some thing like IDENTIFIER. And the corresponding lexeme is the variable name 'avariable'.

The Lexical Analyzer.
1. A main function is required to scan the source and update the current Token and the corresponding Lexeme.( LA_F1 )
2. An interface function is required for the Parser, which will give the current Token.( LA_F2 )
3. Another interface function for the Parser, which will update the current token by calling LA_F1 and return to the Parser.( LA_F3 )
4. A function, which is used to distinguish the identifier and keyword for a word.( LA_F4 )
5. A callback function for the search function bsearch.( LA_F5 )

New functions will be added when required.

1. A variable to hold the current token identified.( LA_V1 Type: Integer)
2. A variable to hold the current Lexeme corresponding to the token.( LA_V2 Type: String)
3. A variable to hold the starting of the current input/output buffer( IO_V1 Type: character pointer)
4. A variable to hold the end of the buffer.( IO_V2 Type: character pointer)
5. A type to represent the Keyword and the corresponding token.( LA_T1 Type: Structure)
6. A Table to store all the keywords and the corresponding tokens.( LA_V3 Type: LA_T1)
The grammar for the Language. Just before creating the Parser.
This grammar is not yet have reviewed. While writing the parser, it may be changed.

program : LB stament's RB

stmnts : stmnts stmnt SEMI
| epsilon

stmnt : declstmnt
| asgnstmnt
| inputstmnt
| printstmnt
| ifstmnt
| switchstmnt
| GOTO label

declstmnt : TYPE varlist

varlist : varlist COMMA varinit
| varinit

varinit : varname
| varname ASSIGN constexprn

varname : IDENTIFIER

constexprn : ICON

asgnstmnt : varname = expression
| varname = SCON

expression : arithexpression
| relexpression
| logicexpression

arithexpression : arithexpression arithop unary
| unary

arithop : MULOP

unary : varname
| LP expression RP

relexpression : arithexpression RELOP arithexpression
| arithexpression

logicexpression : NOT relexpression
| relexpression AND relexpression
| relexpression OR relexpression

inputstmnt : INPUT varlist

printstmnt : PRINT varlist

ifstmnt : IF LP expression RP LB stmnts RB
| IF LP expression RP LB stmnts RB ELSE LB stmnts RB

switchstmnt : SWITCH LP expression RP LB switchbody RB

switchbody : casestmnts defaultstmnt
| casestmnts

casestmnts : CASE constexprn COLUN LB stmnts RB

defaultstmnt : DEFAULT COLUN LB stmnts RB

No comments:

Post a Comment