head 1.1; branch 1.1.1; access ; symbols start':1.1.1.1 cd16:1.1.1; locks ; strict; comment @# @; 1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches 1.1.1.1; next ; 1.1.1.1 date 2003.08.15.17.26.03; author beckert; state Exp; branches ; next ; desc @@ 1.1 log @Initial revision @ text @
A synthesizable 16-bit CPU with development tools.
The CD16 is a cross between a stack CPU and a register CPU. It's a strange critter, like the Nick cartoon character "Catdog". It's also a 16-bit chip. Hence the name CD16. It was designed to fit the back ends of Forth and C compilers, so the architecture lends itself to efficient execution of both Forth and C.
The basic CD16 is free, as are the tools that go with it. Here's what you get:
Documents and downloads | Last Modified |
CD16 User's Manual CD16 Programmer's Manual CD16 Core basic datapaths Installation guide Win32forth v4.2 self-install CD16 ver 1.1 archive Revision history |
2003.07.25 2003.07.25 2003.02.14 2003.08.03 1998 2003.08.03 " |
![]() |
A typical SoC implementation uses a FPGA's block RAMs for the memories.
Data and program memories are synchronous read and synchronous write. The
program space has a bank register to allow addressing of large data. The
program itself is limited to the lower 64K.
The stack RAM (which also stores a register bank and interrupt vectors) is asynchronous read, synchronous write. In most FPGAs, the block RAM doesn't support asynchronous read so this means clocking it at twice the clock frequency and starting a read halfway through the cycle. The coprocessor is user-defined. A stub is supplied in the archive. You can use it to implement application specific instructions. |
The CD16 was designed with the following goals:
The CD16 performs most operations in one clock cycle. It has a shallow pipeline, so branches and calls aren't too expensive. It has a relatively rich instruction set. Being not heavily pipelined, it's not especially fast compared to RISCs. However, it has good coprocessor support so you can add application specific instructions. Since you can add your own application specific hardware, the CD16 can often trump hard CPUs that run many times faster than the CD16.
The CD16 is written in combined VHDL and Forth. Both languages are used to express an RTL model of the CPU. Take a look at the source code. Everything to the left of the "--" delimiter is VHDL while everything to the right is Forth. The Forth system is modified to ignore everything to the left of the "--". So, hardware simulation can be done by either a VHDL simulator or a Forth system. The software simulator (written in Win32forth) loads this file to provide cycle-accurate simulation of both documented and undocumented instructions. It also simulates periodic interrupts.
How big is it and how fast?
On a Xilinx Spartan2, expect 20 MHz and 350 LUTs. That's most of a XC2S30, which is $10 these days. Or, half of a XC3S50 at 40 MHz. The XC3S50 should be well under $10 in 2004. Of course, most of your program memory would probably be off-chip. Xilinx supplies free synthesis and simulation tools for their chips, as do many other FPGA vendors. The following table shows benchmark results produced by the compiler included in the archive. The routines could be hand-optimized to get more speed, but that's not a very fair comparison.
The code size is a little bloated for a Forth chip, but compiler settings are available to trade speed for code size. To implement compact Forth code, all you really need in an ISA are compact calls, branches and literals. When speed is needed, you use whatever else is available.
There are two big reasons I didn't design a pure Forth chip. They are:
Benchmark | Bytes | Clocks |
Sieve of Eratosthenes Fibonacci 9-digit string->number 9-digit number->string Quicksort |
110 34 136 172 540 |
435K 2121K 1600 1240 339K |
How does it compare to other CPUs?
Pretty much what you'd expect from a typical CISC with 1 or 2 cycle instructions. Multiply and divide operations use steps, so 16x16 multiply takes 16 clocks plus setup overhead. Division takes 32 clocks plus setup overhead.
The MCF5307 benchmarks are from Forth Inc's web site and are a couple of years old. Their 68K optimizations are pretty good. Note that the MCF5307 is slowed somewhat by pipeline stalls so mileage is a little worse on simple benchmarks like the sieve.
Sieve Benchmark | Compiler | Bytes | Time | Comments |
CD16 @@ 25 MHz B16 @@ 50 MHz MCF5307 @@ 45 MHz Pentium II @@ 300 MHz |
Raptor hand SwiftX VFX |
110 <100 114 >200 |
17.5 ms 11.7 ms 16.0 ms 1.0 ms |
The sieve benchmark likes fast data memory access and doesn't do much math. |
Fibonacci Benchmark | Compiler | Bytes | Time | Comments |
CD16 @@ 25 MHz MCF5307 @@ 45 MHz Pentium II @@ 300 MHz |
Raptor SwiftX VFX |
34 28 47 |
85 ms 66 ms 4.4 ms |
Call and return are cheap on the CD16. |
Quicksort Benchmark | Compiler | Bytes | Time | Comments |
CD16 @@ 25 MHz MCF5307 @@ 45 MHz |
Raptor SwiftX |
596 540 |
13 ms 6 ms |
Coldfire rocks. Looks like the CD16 code could use some hand tuning. |
How long does behavioral simulation take for the CD16 system model?
On a 1.8 GHz Pentium 4, simulation time for 25 million cycles (1 second real time) is:
Simulation tool | Seconds |
Win32forth VFX ModelSim |
401 47 500 |
The generic ANS Forth simulator is pretty bare bones since there's no graphics standard for Forth. But VFX runs a true hardware simulation about ten times the speed of a VHDL simulation tool.
Future CD16 archives will have: Multitasking, locals support, floating point and a bigger test suite.
You want a C compiler? Retarget GCC or LCC. I'll be using Forth, which is generally more productive, compact and fun.
Trying to climb the FPGA learning curve? Try these links:
http://tutor.al-williams.com
http://www.fpga4fun.com
Suggestions and bug reports: brad@@tinyboot.com
Do you find the CD16 and tools useful? Thank my wife for putting up with the project.
@ 1.1.1.1 log @Imported sources @ text @@