Compilers 101 – Overview and Lexer – Xojo Programming Blog

At XDC 2016 there was a lot of interest in our Compiler session and LLVM. I’ve summarized a bit about LLVM in an earlier post, but to take things further, we put together this series of blog posts on compilers.

These will all be at a high-level. None of these posts are going to teach you how to write a compiler. The goal of these posts is for you to have a basic understanding of the components of a compiler and how they all work together to create a native app.

Compiler Components

A compiler is a complicated thing and consists of many components. In general the compiler is divided into two major parts: the front end and the back end. In turn, those two parts have their own components.

For the purposes of these posts, this is how we will be covering the components of the compiler:

Front End

The front end is responsible for taking the source code and converting it to a format that the back end can then use to generate binary code that can run on the target CPU architecture. The front end has these components:

Back End

The back end takes the IR, optionally optimizes it and then generates a binary (machine code) file that can be run on the target CPU architecture. These are the components of the back end:

Each of these steps processes things to get it a little further along for the next step to handle.

The linker is not technically part of the compiler but is often considered part of the compile process.

Lexer

The lexer turns source code into a stream of tokens. This term is actually a shortened version of “lexical analysis”. A token is essentially a representation of each item in the code at a simple level.

By way of example, here is a line of source code that does a simple calculation:

sum = 3.14 + 2 * 4

To see how a lexer works, let’s walk through how it would tokenize the above calculation, scanning it from left-to-right and tracking its type, value and position in the source code (which helps with more precise reporting of errors):

The first token it finds is “sum”
1. type: identifier
2. value: sum
3. start: 0
4. length: 3
Token: =
1. type: equals or assigns
2. value: n/a
3. start: 4
4. length: 1
Token: 3.14
1. type: double
2. value: 3.14
3. start: 6
4. length: 4
Token: +
1. type: plus
2. value: n/a
3. start: 11
4. length: 1
Token: 2
1. type: integer
2. value: 2
3. start: 15
4. length: 1
Token: *
1. type: multiply
2. value: n/a
3. start: 15
4. length: 1
Token: 4
1. type: integer
2. value: 4
3. start: 17
4. length: 1

As you can see, white space and comments are ignored. So after processing that single line of code there are 7 tokens that are handed off to the next part of the compiler, which is the Parser. The Parser is covered in the next post. Stay tuned!