Hector Martin's Alternate DCPU-16 Proposal Would Allow for a Better C Compiler

The developer community has already made some incredibly quick progress on implementing assemblers, interpreters, and emulators for the proposed virtual computer in 0x10c, Notch's latest game. But the truth is that the majority of programmers out there couldn't be bothered with spending enormous amounts of time writing anything much more complicated than a "hello world" application in assembly. What's on the top of everybody's mind is creating a compiler for a more widely used language.

Digging into what it would take to create a C compiler for the DCPU-16 (the name of the virtual computer in 0x10c), Hector Martin found a few design shortcomings that would slow down and overcomplicate many of the most common operations used when writing code in C. To solve these issues, Martin has proposed an alternate spec for the DCPU-16 (noted below). You can read Martin's full explanation of the problems he found here.

In a nutshell, they stem from the following shortcomings.

The Shortcomings of the Currently Proposed DCPU-16 Spec

Memory is addressed word wise, with no support for 8-bit loads and stores, causing issues with addressing bytes—making pointer arithmetic overly complex and wasteful.
Only having 8 registers (A, B, C, X, Y, Z, I, J) and only supporting PUSH & POP on the stack pointer means there's no random access to the stack to be able to use it to store more registers for, say, 64-bit operations.
The operand design is very wasteful, where either instruction operand can be in memory, or a register, or even a constant, for any instruction.
Everything is unsigned, meaning all unsigned arithmetic will require a lot of overhead, and seeing how common the signed int is in C, this will effectively slow down quite a lot of code.
No "add with carry" or "subtract with carry" support means the overflow doesn't accurately work for higher than 32-bit widths without extra effort.
Instruction timings seem arbitrary and don't seem to reflect the timings of 1980s hardware.

The Solution? Load-Store Architecture (RISC style) + 16 Registers

To allow coding on the DCPU-16 to reach a wider audience by making it easier for the community to build an efficient C-compiler for the DCPU-16, Martin has proposed the following alternate spec in hopes that Notch will use at least some of it to address these issues. The following spec is re-posted here on 0x10c World and formatted for the web with permission from Hector Martin:

Hector Martin's Alternate Spec for the DCPU-16

This is a sketch of an alternate proposal for DCPU-16. While it's definitely not the ideal CPU, I do believe it is markedly better than the old proposal. This design should still allow for compact code and a decent C compiler. It also maintains some aspects of the original design. There's some room for expansion/rework (quite a few <undefined> parts).

Anyone should feel free to hack on or improve upon or take ideas from the design as they see fit. This is just a possible concept that I came up with.

General design

Load-store architecture (RISC style), 16 general registers.

Rationale:
Load-store means explicit loads and stores (duh), but 16 registers means you have reasonable working space for local variables and a decent calling convention.

Endianness

Little endian, because Notch said so (and unlike the original DCPU-16 design, which is word-based and doesn't care about endianness at the ISA level, this one does).

Unfortunately it's going to take some creativity to make a byte-addressed 16-bit machine work with the 0x10c storyline, which depends on an endian-screwup swapping the *words* in a 64-bit number, not the *bytes*... I guess he could make the CPU big-endian and the ABI wordwise little-endian, i.e. mixed-endian (this would certainly be a great excuse for the sleep cell SNAFU).

Memory

128kB, divided into two regions: low 64k and high 64k. The low 64k can be used for instructions and data, but the high 64k can only be used for instructions and rarely addressed as word-wise data.

Rationale:
This keeps the original amount of memory, while allowing for bytewise pointers. It is most likely that large programs will benefit from more code space than more data space. Allowing byte-wise loads and stores means char arrays are possible without fiddling with shifts, which helps save space.

This is still a Von Neumann architecture, as code is allowed in the low half, so self-modifying code is still entirely possible. Code/function pointers are stored as address>>1 (and thus have a different representation to data pointers, but this is entirely allowed by the C standard - and anyone doing self-modifying code knows what he's doing well enough to be able to shift things himself). There are extended instructions to read/write memory by code pointers for the purpose of loading code (but they could be used for large data tables too).

Registers

16 generally addressable registers:

R0 - R13: general purpose registers
R14: SP (supports extra addressing modes)
R15: PC
One implicit 16-bit carry register (C)

PC points to the memory word (e.g. PC=0x8000 points to memory location 0x10000); however, regular pointers point to a byte.

Rationale:
The wide carry register (O in the original design) works well for higher width shifts and also serves as the high bits for mul/div, so that can stay, although div works differently here than in the original design. It is not a general purpose reg but rather a separate register, since many instructions implicitly write it and I'd rather not doom a general register to being clobbered repeatedly.

Instruction format

Having one rigid instruction format is too limiting. However, with just a few simple formats we can cover things much more nicely while still keeping the decoder simple.

Instructions are 16 bits, sometimes followed by a 16-bit immediate value. They are always 16-bit aligned in memory.

Form A (arithmetic):
0oooooaaaabbbbbb (optionally followed by a 16-bit immediate constant)

Form B (load/store):
10ooaaaammmmmmmm (optionally followed by a 16-bit immediate constant)

Form C (branch):
110odddddddddddd

Form D (misc):
111oooooxxxxyyyy (optionally followed by a 16-bit immediate constant)

Note how the opcode is entirely determined by the top 8 bits, which means you can implement the decoder as a 256-entry lookup table or switch.

Operands:

a is a register r0-r15

b has the following modes:

0x00 - 0x0f: register r0-r15
0x10 - 0x1f:
0x10: immediate constant (next 16 bits)
- 0x11: 0xffff (-1)
-  0x12: <undefined>
-  0x13: <undefined>
-  0x14: <undefined>
-  0x15-0x1f: 1<<5 through 1<<15 (power of two constant)
0x20 - 0x3f: constant 0x00-0x1f (not sign extended)
Rationale: Note that this covers 1<<4 and lower. Subtracting small values can be done with sub.

m has the following modes:

0x00 - 0x0e: [r0..r14]
0x0f: [r14 - 2]! // predecrement by 2, i.e. push
0x10 - 0x1e: [r0..r14 + immediate offset]
0x1f: [immediate address]
0x20 - 0x2e: [r0..r14], immediate offset // postupdate
0x2f: [r14], 2 // postincrement by 2, i.e. pop
0x30 - 0xff: [r14 + 0x00..0xcf] // stack frame relative

How to tell whether there is an immediate constant following the instruction:

Form A: b == 0x10
Form B: m >= 0x10 && m <= 0x2e
Form D: o == 1

Opcodes

Notation: x:y means x concatenated with y (x is the high bits)

Form A opcodes:

00: add        c:a = a + b
01: addc c:a = a + b + c
02: sub c:a = a - b
03: subc c:a = a - b + c
04: rsb c:a = b - a
05: rsbc c:a = b - a + c
06: shl // interpret b&0x1f as a signed int, shift right if negative
if (!(b & 0x10)) c:a = a << (b & 0xf);
else a:c = a << (b & 0xf);
07: shlc // same, but shift in carry
if (!(b & 0x10)) c:a = (a << b) | c;
else a:c = (a << (b & 0xf)) | (c << 16);
08: sar        a = a >> (b & 0xf); // arithmetic shift
09: mov a = b
0a: and a &= b
0b: bcl a &= ~b
0c: or a |= b
0d: xor a ^= b
0e: mul        c:a = a * b // unsigned
0f: muls c:a = a * b // signed
10: div a = c:a / b // unsigned
11: divs a = c:a / b // signed
12: mod a %= b // unsigned
13: mods a %= b // signed
14: <undefined>
15: <undefined>
16: ifeq skip if !(a == b)
17: ifne skip if !(a != b)
18: ifgt skip if !(a > b) // signed
19: ifle skip if !(a <= b) // signed
1a: iflt skip if !(a < b) // signed
1b: ifge skip if !(a >= b) // signed
1c: ifhi skip if !(a > b) // unsigned
1d: ifls skip if !(a <= b) // unsigned
1e: iflo skip if !(a < b) // unsigned
1f: ifhs skip if !(a >= b) // unsigned

Note that if* needs to minimally decode the next instruction to determine whether it has a 16-bit operand to be skipped too (see the previous section).

Form B opcodes:

0: lw a = [m] // word
1: stw [m] = a // word
2: lb a = 0:[m] // byte
3: stb [m] = a // byte (ignores high bits)

Unaligned accesses are, of course, invalid ;)

Form C opcodes:

0: jmp pc += d // d is sign-extended from 12 bits to 16
1: jsr r14 -= 2; [r14] = pc + 1; pc += d // same as above

Form D opcodes:

00: jsr rx r14 -= 2; [r14] = pc + 1; pc = rx // indirect subroutine jump to rx
01: jsr imm r14 -= 2; [r14] = pc + 1; pc = imm // long subroutine jump to imm
02: lpw rx = [ry] // load program word (128kB word-wise addressing)
03: stpw [ry] = rx // store program word (128kB word-wise addressing)
04..1f <undefined>

Recipes

clear c (e.g. before 16/16=16 division)
shl r0, 0 // or add r0, 0, or sub...

write c (e.g. before 32/16=16 division). Note: clears register too.
shl rX, 16

read c (e.g. after multiplication)
subc rX, rX // clears c
or
shlc rX, 16 // exchanges c and rX

shift right
shl rX, -bits // note: assembler should use one-word form as (-bits)&0x1f

not
xor rX, 0xffff // fits in one word, b=0x11

long branch
mov pc, destination // can be immediate or register

pop (ret with rX=pc)
lw rX, [r14], 2 // i.e. pop rX, uses m=0xf

push
stw rX, [r14-2]!, 2 // i.e. push rX, uses m=0x1f

Instruction cycle times

Example instruction cycle times:

Base instruction time: 1 cycle
if 16-bit immediate: +1 cycle
mul: +1 cycle
div: +31 cycles
mod: +15 cycles
jumps (anything dest=pc, jsr, jmp, ret): +1 cycle
loads/stores (incl. push, pop): +1 cycle
conditional instructions: if condition is false (skip), +1 cycle

For example, mov pc, 0x1234 takes 3 cycles, but jmp 0x1234 (when 0x1234 is within relative addressing range) takes only 2. jsr 0x1234 takes 4 cycles when out of relative range (Form D 1), but 3 when in range (Form C 1). ret takes 3 cycles. And something like div pc, 0x1234 (which is insane in and of itself) would take 34 cycles.

License

I hereby place this document in the public domain.
-- Hector Martin "marcan"

Just updated your iPhone? You'll find new emoji, enhanced security, podcast transcripts, Apple Cash virtual numbers, and other useful features. There are even new additions hidden within Safari. Find out what's new and changed on your iPhone with the iOS 17.4 update.