head 1.1; branch 1.1.1; access ; symbols start':1.1.1.1 cd16:1.1.1; locks ; strict; comment @# @; 1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches 1.1.1.1; next ; 1.1.1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches ; next ; desc @@ 1.1 log @Initial revision @ text @ CD16 soft CPU core

CD16 Soft CPU

A synthesizable 16-bit CPU with development tools.

The CD16 is a cross between a stack CPU and a register CPU. It's a strange critter, like the Nick cartoon character "Catdog". It's also a 16-bit chip. Hence the name CD16. It was designed to fit the back ends of Forth and C compilers, so the architecture lends itself to efficient execution of both Forth and C.

The basic CD16 is free, as are the tools that go with it. Here's what you get:

Documents and downloads Last Modified
CD16 User's Manual
CD16 Programmer's Manual
CD16 Core basic datapaths
Installation guide
Win32forth v4.2 self-install
CD16 ver 1.1 archive
Revision history
2003.07.25
2003.07.25
2003.02.14
2003.08.03
1998
2003.08.03
"

A typical SoC implementation uses a FPGA's block RAMs for the memories. Data and program memories are synchronous read and synchronous write. The program space has a bank register to allow addressing of large data. The program itself is limited to the lower 64K.

The stack RAM (which also stores a register bank and interrupt vectors)  is asynchronous read, synchronous write. In most FPGAs, the block RAM doesn't support asynchronous read so this means clocking it at twice the clock frequency and starting a read halfway through the cycle.

The coprocessor is user-defined. A stub is supplied in the archive. You can use it to implement application specific instructions.


The CD16 was designed with the following goals:

The CD16 performs most operations in one clock cycle. It has a shallow pipeline, so branches and calls aren't too expensive. It has a relatively rich instruction set. Being not heavily pipelined, it's not especially fast compared to RISCs. However, it has good coprocessor support so you can add application specific instructions. Since you can add your own application specific hardware, the CD16 can often trump hard CPUs that run many times faster than the CD16.

The CD16 is written in combined VHDL and Forth. Both languages are used to express an RTL model of the CPU. Take a look at the source code. Everything to the left of the "--" delimiter is VHDL while everything to the right is Forth. The Forth system is modified to ignore everything to the left of the "--". So, hardware simulation can be done by either a VHDL simulator or a Forth system. The software simulator (written in Win32forth) loads this file to provide cycle-accurate simulation of both documented and undocumented instructions. It also simulates periodic interrupts.

Benchmarks

How big is it and how fast?

On a Xilinx Spartan2, expect 20 MHz and 350 LUTs. That's most of a XC2S30, which is $10 these days. Or, half of a XC3S50 at 40 MHz. The XC3S50 should be well under $10 in 2004. Of course, most of your program memory would probably be off-chip. Xilinx supplies free synthesis and simulation tools for their chips, as do many other FPGA vendors. The following table shows benchmark results produced by the compiler included in the archive. The routines could be hand-optimized to get more speed, but that's not a very fair comparison. 

The code size is a little bloated for a Forth chip, but compiler settings are available to trade speed for code size. To implement compact Forth code, all you really need in an ISA are compact calls, branches and literals. When speed is needed, you use whatever else is available.

There are two big reasons I didn't design a pure Forth chip. They are:

Benchmark Bytes Clocks
Sieve of Eratosthenes
Fibonacci
9-digit string->number
9-digit number->string
Quicksort
110
 34
136
172
540
435K
2121K
1600
1240
339K

How does it compare to other CPUs?

Pretty much what you'd expect from a typical CISC with 1 or 2 cycle instructions. Multiply and divide operations use steps, so 16x16 multiply takes 16 clocks plus setup overhead. Division takes 32 clocks plus setup overhead.

The MCF5307 benchmarks are from Forth Inc's web site and are a couple of years old. Their 68K optimizations are pretty good. Note that the MCF5307 is slowed somewhat by pipeline stalls so mileage is a little worse on simple benchmarks like the sieve.

Sieve Benchmark Compiler Bytes Time Comments
CD16 @@ 25 MHz
B16 @@ 50 MHz
MCF5307 @@ 45 MHz
Pentium II @@ 300 MHz
Raptor
hand
SwiftX
VFX
110
<100
114
>200
17.5 ms
11.7 ms
16.0 ms
1.0 ms
The sieve benchmark likes
fast data memory access and
doesn't do much math.
Fibonacci Benchmark Compiler Bytes Time Comments
CD16 @@ 25 MHz
MCF5307 @@ 45 MHz
Pentium II @@ 300 MHz
Raptor
SwiftX
VFX
34
28
47
85 ms
66 ms
4.4 ms
Call and return are cheap on 
the CD16.
Quicksort Benchmark Compiler Bytes Time Comments
CD16 @@ 25 MHz
MCF5307 @@ 45 MHz
Raptor
SwiftX
596
540
13 ms
6 ms
Coldfire rocks. Looks like the CD16
code could use some hand tuning.

How long does behavioral simulation take for the CD16 system model?

On a 1.8 GHz Pentium 4, simulation time for 25 million cycles (1 second real time) is:

Simulation tool Seconds
Win32forth
VFX
ModelSim
401
47
500

The generic ANS Forth simulator is pretty bare bones since there's no graphics standard for Forth. But VFX runs a true hardware simulation about ten times the speed of a VHDL simulation tool.

Support

Future CD16 archives will have: Multitasking, locals support, floating point and a bigger test suite. 

You want a C compiler? Retarget GCC or LCC. I'll be using Forth, which is generally more productive, compact and fun.

Trying to climb the FPGA learning curve? Try these links:
http://tutor.al-williams.com
http://www.fpga4fun.com

Suggestions and bug reports: brad@@tinyboot.com

Do you find the CD16 and tools useful? Thank my wife for putting up with the project.

@ 1.1.1.1 log @Imported sources @ text @@