Compiler architecture

The SC compiler parses files written in the SC language and translates it to 8051 assembly instructions. By default, it also optimizes the generated assembly instructions to minimize code size.

SC language

The SC language is a language that somewhat resembles the syntax of the C language. The exact origins of the name SC have been lost in time, but nowadays it’s considered to mean ‘Sort-of C’.

The language was created from a desire to create a compiler targeting 8051-compatible microcontrollers, while not having to implement all of the behaviour and traits of C.

In the source code, the grammar is specified using token definitions and yacc-like grammar rules. In the language reference, the grammar is specified using BNF-like syntax.

Compiler operation

Parser

First, the parser tokenizes the input. It splits the input data into tokens based on rules. Whitespace separates tokens, and certain tokens (like parentheses and operator symbols) also act as separators between other tokens. When there are multiple ways to tokenize a certain sequence, longer tokens are chosen over shorter tokens.

Currently, the SC compiler uses lex from the PLY library to tokenize the input.

After the input has been tokenized, the parser constructs a syntax tree. Currently, the SC compiler uses yacc [1] from the PLY library to parse the input tokens. Yacc is a LALR(1) parser generator which defines the parser according to grammar rules. When yacc reduces a grammar rule, compiler code related to that rule is executed, which, in the case of this compiler, creates the syntax tree element for that rule.

If at any time invalid input is encountered, parsing aborts and an appropriate error message and error location is shown.

When parsing completes, a syntax tree has been constructed for the parsed module. If the module references other modules which have not yet been parsed, they will also be parsed.

Finally, once all referenced modules have been parsed, the compiler merges the separate syntax trees.

Compiler

If a syntax tree exists for the given input modules, the modules are syntactically valid. However, expressions could refer to identifiers that don’t exist. To check for this, the compiler examines all syntax tree elements and verifies that the references exist. If they don’t, an error message is shown.

The compiler also checks for usage of defined identifiers. If a function is never called, or a variable is never read or written, a warning message is shown.

At this point, the input modules do not contain invalid code. Machine code generation can begin.

First, addresses are assigned to data fields. Internal data fields start at the first available address, after memory used for data by generated code. External data fields start at a certain offset. See the memory organization documentation for specific details.

After addresses are assigned, assembly instructions are generated. First, the compiler generates a few initialization instructions. Data fields are initialized to their specified values. Finally, the main function is called, and a return trap is added.

Next, the syntax tree is transformed into assembly instructions. This is done recursively, beginning with functions. Functions save memory used by variables, copy their arguments, execute their contents and finally restore the used memory. Expressions put their resulting values on the stack. Any arguments are taken from the stack after evaluation.

After all functions have been transformed into assembly instructions, optimization can take place. This is done by a series of transformations, described in more detail below.

Finally, the assembly instructions are converted to text format, suitable for assembler input. The SC compiler assumes the as504 assembler [2] is used.

To support the debugger (see below), the compiler creates a debug file. This contains the addresses of functions as well those of data fields. This allows the debugger to show the location relative to functions instead of merely an address. In the future, more debug data can be created.

Optimizer

The optimizer aims to reduce code size. In almost every case, smaller code is faster code. The optimizer works by repeating transformations on the assembly instructions until no further gains can be achieved. Transformations are only applied if they do not change the effects of the instructions.

Some of the applied transformations are:

  • Dead code elimination. If an instruction is not reachable, it is removed.
  • Tail call optimization. If the last instruction of a function is a call instruction, it can be replaced with a jump instruction.
  • Jump shortening. A LJMP instruction can be shortened to AJMP or SJMP if the jump target is in range. The size of a LJMP instruction is one byte more than AJMP and SJMP. This can also be applied to LCALL and ACALL.
  • Removal of useless code. Pushing a value on the stack followed by popping it to the same location is useless and can be removed.
  • Stack flattening. PUSHing a value on the stack followed by POPping it to a different location can be replaced with a single MOV instruction.
  • Instruction reordering. PUSH instructions can be moved down if the following instruction is unrelated to the address being PUSHed. Similarly, POP instructions can be moved up. This allows other transformations to have an effect on these instructions.
  • Jump flattening. If a jump instruction targets another unconditional jump instruction, the target of the first jump can be replaced with the target of the second jump.
  • Conditional jump negation. A conditional jump can be replaced with a negated version. This can result in shorter code if the following instruction is a jump instruction.

Diagnostic messages

Because the capabilities of the target hardware are limited, the SC compiler always prints some diagnostic messages. Among these messages are:

  • Program size. The 8051 can only address 65536 bytes of code memory. In the SC compiler, this is arbitrarily limited to 16384 bytes. [3]
  • Calculated maximum stack usage. The maximum stack usage must be at most the space available to the stack. If the stack overflows, it corrupts memory used for other purposes. If optimization is enabled, the true maximum stack usage is less than the value displayed. [4]
  • Available stack size. If no internal data fields exist, there are 242 bytes available for the stack. If the number of internal data fields is the maximum possible, only 128 bytes are available for the stack. Since function calls and expressions can use the stack, it is important that enough stack space is available. Otherwise, the stack could corrupt memory used for other purposes.
  • Unused identifiers. Functions are shown when they are never called, and fields are shown when they are never read or written. In the current version, local variables aren’t shown yet.

If verbose mode (-v) is enabled, the following messages are also printed:

  • Memory usage. For both internal and external memory, this shows the addresses used for variables.

If verbose progress mode (-p) is enabled, the following progress messages are also printed:

  • Parsing module foo. This is printed when a new module is about to be parsed.
  • Resolving identifiers. This is printed when all modules have been parsed and identifiers are being resolved.
  • Generating assembly. This is printed when assembly is about to be generated.
  • Optimizing assembly. This is printed when the generated assembly is about to be optimized.
  • Assembly optimized. This is printed when optimization has been completed.

If both verbose mode and verbose progress mode are enabled, each optimization iteration prints the current code size.

There are some other diagnostic options, but those are only useful when working on the compiler source code.

Debugger and uploader

The standard software to upload and debug programs to the target hardware is dScope. However, this is an old 16 bit Windows program that does not work on 64 bit Windows, nor on non-Windows systems. In addition, the user interface of dScope is not too great. Because of this, I decided to write a debugger and uploader, intended to replace dScope.

The protocol used by dScope to communicate with Mon51 (the monitor program installed on the target hardware to support the debugging and uploading features) is proprietary. While it is possible to access the flash memory and replace its contents with another program, this is not desirable because mistakes are permanent. [5] Therefore, the only option is to use the existing Mon51 program.

Mon51 itself is free to use, but its protocol is proprietary. By performing actions in dScope and monitoring serial port communication, I have partially documented its protocol. Using this documentation, I have written an interface library that supports execution control and reading and writing of memory. Using this library, I have written a command line debugger program, and a specialized uploader program.

The debugger can upload programs, show memory contents, show contents of named data fields and can, of course, step and run. The uploader merely uploads the specified program and runs it.

License

Creating a language, compiler, optimizer, uploader and debugger is fun, but they are useless if they can’t be used by others. Therefore, I have released the source code and documentation under the terms of the MIT license:

Copyright (c) 2010 Matthijs van der Vleuten

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Footnotes

[1]PLY yacc is inspired by yacc, but it is not exactly the same. PLY yacc shares some of the syntax (and the name) of yacc, but the interface differs.
[2]as504 can be found at http://www.vanwal.nl/as504/. The asm504.com assembler is not supported.
[3]Programs larger than 16384 bytes probably need more advanced memory features, like pointers.
[4]Determining the maximum stack usage for arbitrary assembly code is a hard problem.
[5]For example, if uploading a program is interrupted, the flash memory contains some code from the original program and some of the new program. Or, a program could be uploaded which does not support uploading new programs, making that program unerasable.

Table Of Contents

Previous topic

5. Memory

Next topic

MON51 protocol

This Page