Languages, Compilers and Interpreters

This document is a simple introduction to compilers and associated software. (c) Dr Robert Harle, 2015.

Languages, Compilers and Interpreters

Hand-crafting assembly code is an enjoyable task for the gifted few. But there are serious practical issues with doing so, including:

Portability. Assembly is just a direct mapping of machine code so it is inevitably architecture-specific. In fact, to eek out every last bit of performance, assembly often ends up using processor-specific operations.
Debug time. Debugging assembly code is hard work and time consuming.
Readability. It is hard to read assembly, even code you wrote yourself a while ago.
Productivity. A single line of assembly does not do very much. Therefore it takes a huge amount of programmer time to write a large application directly in assembly code.

To solve these problems we use compilers to convert higher-level, more portable code (written in other languages) to machine-specific machine code.

Compilers are just software programs. They allow us to decouple our programming from the underlying machine architecture (i.e. CPU). If you create a new CPU you just need to modify your compiler to output the new machine code that the CPU understands. For example, once you have a compiler for the C programming language, you can compile all the C code out there into machine code that will run on your CPU.

Types of Languages

Like natural (spoken) languages, programming languages are a combination of vocabulary (the keywords) and grammar (how the words must be put together to make sense). Many natural languages have similar grammar because they evolved from or next to each other (e.g. French and English are very similar). Linguists often comment that once you’re fluent in a few languages, picking up a new one is quick and easy: you just need to learn the vocabulary and figure out which of the grammar rules apply.

The same is true of computer languages. You can think of any one language as a collection of concepts (grammar rules) that someone put together because they felt it made programming for a particular task easier. In practice, this means there are a lot of languages, but it’s easy to classify them.

The main classification is into imperative and declarative. You’ll understand these terms in detail by the end of the first term. For now, imperative is a language where you tell a computer what and how to do something. A declarative language is where you can specify what to do, but not how, Declarative languages are considered “high-level” because they abstract away the underlying machine. An example declarative statement might be:

select * from student_database where firstname='Alice';

This is actually a piece of SQL (a database language). It instructs the database to get all entries in the student_database where the first name is Alice. But note it says nothing about how the computer should go about doing this (e.g. which search algorithm it should use). There is simply no syntax in the language to specify that level of detail. Of course, someone expert has programmed how to do this task into the database, so it’s not magic: the point is that they didn’t use the language SQL because it only provides high-level concepts.

If you’re feeling a bit confused, don’t worry at this stage - things will become clearer over the coming months. This year we will introduce you to both a declarative language (ML) and an imperative one (Java), and show you the various concepts they feature.

When developing software you have three main tasks:

Write the code;
Compile the code;
Debug the operation of the program (it never works properly first time!).

These are typically done iteratively: you write a chunk of code, edit it until it compiles, try it, then edit it until it does what it is meant to. Then start on the next chunk.

For many languages the compilation is explicit. For example

g++ hello_world.cpp -o hello_world

is what you might call to compile a C++ program in the hello_world.cpp file (g++ is the name of a widely-used C++ compiler). If all is well, it spits out a hello_world binary file that contains the machine code for your processor that you can run. The key thing is that your code can be run through a compiler on another platform to produce the machine code for that system. E.g.

At least, that’s the theory. In practice there are complications…

Libraries and Platforms

Let’s imagine that you make software to run on Microsoft Windows. It would be very annoying to have to repeatedly write the `nuts and bolts’ code to do things like draw a window, react to a mouse click or play a video clip. Furthermore, if everyone did this, every program would look and behave differently: we would lose all consistency.

Instead we rely on software libraries. These are simply chunks of code written by someone else (who is hopefully an expert in that thing!). They are usually distributed pre-compiled for our system (i.e. as machine code) and we link our programs to them. This just means that our compiled program contains instructions that say “now run function X(), which is in library Y”. If Library Y doesn’t exist, our program chokes.

Now, each Operating System comes packed with libraries to do key things in a standard, uniform way (e.g. draw a window). And here’s where our portability starts to break down. A typical Windows program depends on a pile of Microsoft libraries. Those libraries simply don’t exist on, say, an Apple Mac or an Ubuntu Linux machine. So our program can’t compile for those Operating Systems. This is despite the fact that they may have the exact same CPU!

We call the combination of CPU architecture and Operating System a Platform . We generally compile our code for a specific platform.

So, libraries are great for saving us time and effort (re)writing standard things (“boilerplate code”) that are just a distraction from our actual program. They also have another major advantage: saving space. Each library is installed once on a machine, but may be used by hundreds of programs. Without them, each of those programs would have its own copy of that code and our programs would be significantly larger.

Interpreters

Instead of compiling a program in one go ahead of time, it is possible to translate a high-level programming language into machine instructions while the program is running. Since the program under execution is not written in machine code, it needs another piece of software called an interpreter to translate the program into instructions the CPU understands.

In other words, you can think of an interpreter as a special compiler that compiles the program as it runs. The idea is you distribute your program written in a high-level language and the interpreter compiles it to machine code and sends it to the CPU. Many modern scripting languages use this approach, including JavaScript, Python and Ruby.

Advantage	Disadvantage
The development cycle (see above) can be quicker as it does not have a compilation step	You have to share the program source code, which is typically larger
	More errors can occur at runtime
	Performance hit - we are simultaneously compiling and running

The errors comment is worth reiterating. Say you have some code like this (written in some pseudo language):

if( x is 5 )
then print "Yes"
else pint "no"   # ERROR: Typo on word "print"

There is a glaring error in the code. A compiler needs to check every line before execution since it needs to convert the whole program to machine code. This would allow us to find the error early. However, an interpreter may only notice an error when it tries to execute the third line. So, if we don’t execute that line in testing (say we test with x=5 only), we might miss the error. Thinking our code works, we release it, only to be hit with a mass of complaints from users.

Java’s Hybrid Approach

If you’ve started the Java pre-course you may have noted that you must run:

javac MyProgram.java

to compile your code, then you have to run

java MyProgram

to run your compiled Java program. Further investigation will reveal java is actually a program itself! In fact javac compiles the code to an intermediate representation which cannot be executed directly by any CPU. The java program is an interpreter which translates the intermediate representation to the format required by the CPU at runtime. You’ll look at this in more detail in the OOP course.

Conclusion

Understanding the various levels of `code’ (machine, assembly, low level, high level) and the need for compilers or interpreters is crucial to understanding the fundamentals of programming.