head 1.1; branch 1.1.1; access ; symbols start':1.1.1.1 cd16:1.1.1; locks ; strict; comment @# @; 1.1 date 2003.08.15.17.26.02; author beckert; state Exp; branches 1.1.1.1; next ; 1.1.1.1 date 2003.08.15.17.26.02; author beckert; state Exp; branches ; next ; desc @@ 1.1 log @Initial revision @ text @
Stack machines (Forth chips) usually fall into two categories. The first category is the Novix NC4016 and descendants. The NC4016 uses an unencoded instruction format. Various instruction groups are formatted in independent fields of bits that simultaneously control different parts of the machine, much like horizontal microcode. So while the instructions are fairly simple, several of them may be performed per cycle.
The second category is Chuck Moore's MISC (Minimal Instruction Set Computer) architecture. MISC packs several 5-bit instructions into a program word. Chuck Moore has special design tools for asynchronous CPU design, leading to MISC CPUs that are very fast. The rest of us are stuck with synchronous design tools. There are several FPGA-based CPU designs based on Chuck's work. Instructions are decoded one clock cycle at a time. With three or more instructions per program word, MISC alleviates the memory bottleneck. But in a SOC, program memory is on-chip. Without the bottleneck, it might be better to take advantage of the memory's bandwidth with more complex instructions.
The drawback with traditional stack machines is that instructions are wasted shuffling around parameters. Forth primitives operate on top of the stack to manipulate implicit parameters. But, you have to look at what these instruction sequences are doing. In an algorithm with more than a few parameters, a lot of shuffling is going on. The expense of this shuffling is hard to quantify, being dependent on the application. The CD16 is a general purpose CPU intended for moderate number crunching, so we assume that there will be a lot of parameter shuffling.
The CD16 has 4-bit stack address fields that allow operations to be performed on the stack without the prerequisite shuffling. An optimizer folds shuffling activity into the instructions used to do useful work. Parameters are divided into two groups: Stack and Global. Stack parameters address the top eight elements of the data or return stack. Global parameters address eight global registers.
The CD16 uses a semi-encoded instruction format. With the parameter address bits, it doesn't have very many bits left for the unencoded approach used by the NC4016. The ISA is divided into nine groups, each with its own format. The instruction set is a collection of useful instructions that fall out of the implementation. Other instructions are possible but not documented. For these, you can refer to the CPU's HDL code. The simulator is based on the hardware RTL model, so you are free to use undocumented instructions if you find some that are useful.
There is good support for Forth locals, which are stored in a frame on the return stack. C stack frames can be set up on the return stack, allowing a C compiler to generate reasonable code for the CD16.
This is a guide to the CD16 instruction set. In case of ambiguity, the CPU's RTL model is the definitive specification.
These mnemonics are used by the assembler. The disassembler may display in either mnemonic or algebraic format. Algebraic format will usually be used, so you don't have the extra step of looking up cryptic mnemonics in the manual.
Operands: | Range: |
Sa = A operand: 0..7 = top of stack, 8..F = global register 0..7. | S0 ... S7, G0 ... G7 |
Sb = B operand: 0..7 = top of stack, 8..F = global register 0..7. | S0 ... S7, G0 ... G7 |
Ra = A operand: 0..7 = top of return stack, 8..F = global 0..7. | R0 ... R7, G0 ... G7 |
K6 = 6-bit signed literal | -32 ... 31 |
Ria = 6-bit index into the return stack | S0 ... S31 |
Sia = 6-bit index into the data stack | R0 ... R31 |
Condition codes for Scc and SKcc are encoded using bits 11:8.
cccc= | extension | True when | Comparison |
0000 | N | Never | |
0001 | A | Always | |
0010 | LS | W=0 or C=0 | unsigned <= |
0011 | HI | W<>0 and C=1 | unsigned > |
0100 | CS | C = 1 | unsigned >= |
0101 | CC | C = 0 | unsigned < |
0110 | EQ | W = 0 | = |
0111 | NE | W <> 0 | <> |
1000 | VS | V = 1 | |
1001 | VC | V = 0 | |
1010 | MI | W < 0 | |
1011 | PL | W >= 0 | |
1100 | LT | W>0 xor V=1 | signed < |
1101 | GE | W<0 xor V=0 | signed >= |
1110 | LE | W<=0 xor V=1 | signed <= |
1111 | GT | W<=0 xor V=0 | signed > |
Name | Operands | Encoding | |
ADD | sa sb sa|w | 0100 c010 aaaa bbbb | |
W = S[a] + S[b], copy to S[a] if c=1. Save carry in CF | |||
ADDC | sa sb sa|w | 0100 c011 aaaa bbbb | |
W = S[a] + S[b] + CF, copy to S[a] if c=1. Save carry in CF | |||
ADDNC | sa sb sa|w | 0100 c000 aaaa bbbb | |
W = S[a] + S[b], copy to S[a] if c=1. Doesn't modify CF. | |||
k | ADDRP | 0000 10kk kkkk 1100 | |
Add 6-bit signed k to RP. | |||
k | ADDSP | 0000 10kk kkkk 0100 | |
Add 6-bit signed k to SP. | |||
AND | sa sb sa|w | 0100 c101 aaaa bbbb | |
W = S[a] AND S[b], copy to S[a] if c=1. | |||
ASR | sb sa | 0101 1111 aaaa bbbb | |
W = S[a] = S[b] / 2. CF = carry. | |||
n | BANK | 0111 0001 01nn nnnn | |
Select the bank to use for extended program memory access. | |||
dest | BR | 0001 dddd dddd dddd | |
Relative branch. The destination of the branch is P+d where d is a signed 12-bit displacement. | |||
n | COP | sb | 0010 nnnn nn00 bbbb |
Coprocessor S[b], instruction n. | |||
n | COPI | sb | 0010 nnnn nn10 bbbb |
Coprocessor S[b], instruction n. Increment SP. | |||
n | COPW | sb | 0010 nnnn nn10 bbbb |
Coprocessor S[b], instruction n. Write result to S[b]. | |||
n | COPWI | sb | 0010 nnnn nn11 bbbb |
Coprocessor S[b], instruction n. Write result to S[b]. Increment SP. | |||
CALL | dest | 1ddd dddd dddd dddd | |
Absolute subroutine call to 2*d. Covers 64K program space. | |||
CRC | sa sb | 0111 0111 aaaa bbbb | |
CRC step. Copies result to W. If W[n]=1 then S[b] = S[b] * 2 ^ S[a] else S[b] = S[b] * 2. | |||
DEC | sb sa | 0101 0011 aaaa bbbb | |
S[a] = S[b] - 1. | |||
DIV1 | sb sa | 0111 0100 aaaa bbbb | |
Division step 1. If carry(S[a]+S[b])=1 then S[a] = S[a] + S[b]. Latch carry in CF. | |||
DIV2 | sb | 0111 0101 0000 bbbb | |
Division step 2. W:S[b] = W:S[b] * 2 + CF. | |||
DROP | 0000 0000 0010 0000 | ||
Pop and discard top of data stack. | |||
EXEC | 0000 0100 0000 1111 | ||
Call the word at the address on top of the return stack. | |||
IDLE | 0000 0000 0000 1000 | ||
Suspend CPU operation until an interrupt request. | |||
IJMP | 0000 0100 0000 0001 | ||
Indirect jump. Pops jump address off the data stack. | |||
IJMPD | 0000 0000 0000 0001 | ||
Indirect jump. Pops jump address off the data stack. The instruction immediately following IJMPD will be executed regardless. | |||
INC | sb sa | 0101 0001 aaaa bbbb | |
W = S[a] = S[b] + 1. | |||
INC2 | sb sa | 0101 0010 aaaa bbbb | |
W = S[a] = S[b] + 2. | |||
JMP | ra | 0000 0100 aaaa 0111 | |
Jump: Swap P, S[a]. | |||
LDD | sb sa | 0110 0100 aaaa bbbb | |
S[a] = pipe. Pipe = DataMem[S[b]]. | |||
LDDD | sb sa | 0110 0111 aaaa bbbb | |
S[a] = pipe. Pipe = DataMem[S[b]]. S[b] = S[b] - 1. | |||
LDDI | sb sa | 0110 0101 aaaa bbbb | |
S[a] = pipe. Pipe = DataMem[S[b]]. S[b] = S[b] + 1. | |||
n | LDDM | sb | 0000 0110 0000 0110 nnnn nnnn nnnn nnnn |
S[a] = DataMem[n]. | |||
LDDS | sb | 0110 0110 0000 bbbb | |
Start data read: Pipe = DataMem[S[b]]. | |||
LDP | sb sa | 0110 0000 aaaa bbbb | |
S[a] = ProgMem[S[b]]. | |||
LDPD | sb sa | 0110 0011 aaaa bbbb | |
S[a] = ProgMem[S[b]]. S[b] = S[b] - 1. | |||
LDPI | sb sa | 0110 0001 aaaa bbbb | |
S[a] = ProgMem[S[b]]. S[b] = S[b] + 1. | |||
LDPP | sb | 0110 0010 0000 bbbb | |
Push ProgMem[S[b]] onto data stack. | |||
LDRP | ra | 0000 0000 aaaa 1010 | |
RP = R[a]. | |||
LDSP | sa | 0000 0000 aaaa 0010 | |
SP = S[a]. | |||
LDW | sb | 0111 1000 0000 bbbb | |
W = S[b], clear CF | |||
n | LIT | 0000 1100 0000 0110 nnnn nnnn nnnn nnnn | |
Push immediate data n to the data stack. | |||
n | LITM | 0000 1110 0000 0110 nnnn nnnn nnnn nnnn | |
Push dmem[n] to the data stack. | |||
LSL | sb sa | 0101 1000 aaaa bbbb | |
W = S[a] = S[b] << 1. | |||
LSR | sb sa | 0101 1100 aaaa bbbb | |
W = S[a] = S[b] >> 1. | |||
MASK | sb sa | 0111 0010 aaaa bbbb | |
S[a] = S[a] AND S[b] | |||
MASKH | sb sa | 0111 0011 aaaa bbbb | |
S[a] = ByteSwap(S[a]) AND S[b] | |||
n | MOVL | sa | 0000 0000 aaaa 0110 nnnn nnnn nnnn nnnn |
Store immediate data n16 to W and S[a]. Two-cycle, two-cell instruction. | |||
n | MOVLR | ria | 0000 10aa aaaa 1110 nnnn nnnn nnnn nnnn |
Store immediate data to R[a]. | |||
n | MOVLS | sia | 0000 10aa aaaa 0110 nnnn nnnn nnnn nnnn |
Store immediate data to S[a]. | |||
MOVRS | ria sb | 0011 01aa aaaa bbbb | |
Move R[a] to S[b]. | |||
MOVSR | sb sia | 0011 00aa aaaa bbbb | |
Move S[b] to R[a]. | |||
MOVS | sa sb sa|w | 0100 c100 aaaa bbbb | |
W = S[b], copy to S[a] if c=1. | |||
MOVSS | sb sa | 0101 0000 aaaa bbbb | |
W = S[a] = S[b] | |||
MOVW | sb | 0100 0100 0000 bbbb | |
W = S[b] | |||
MOVWR | ra | 0000 0000 aaaa 1101 | |
Store W to R[a]. | |||
MOVWS | sa | 0000 0000 aaaa 0101 | |
Store W to S[a]. | |||
MUL | sa sb | 0111 0110 aaaa bbbb | |
Multiplication step. If W[n]=1 then W:S[b] = W:S[b] * 2 + S[a] else W:S[b] = W:S[b] * 2 | |||
MULC0 | sb | 0111 1001 0000 bbbb | |
W = S[b]/2, clear CF ( begin multiplication ) | |||
MULC1 | sb | 0111 1010 0000 bbbb | |
W = (W+CF)*2 + B[n], clear CF ( end multiplication ) | |||
NEG | sb sa | 0101 0101 aaaa bbbb | |
W = S[a] = -S[b]. | |||
NOP | 0000 0000 0000 0000 | ||
No operation. | |||
NOT | sb sa | 0101 0100 aaaa bbbb | |
W = S[a] = !S[b]. | |||
OR | sa sb sa|w | 0100 c110 aaaa bbbb | |
W = S[a] OR S[b], copy result to S[a] if c=1. | |||
POPS | sb | 0011 1100 0000 bbbb | |
Pop S[b] from return stack. | |||
n | PUSHI | 0000 1100 0000 1110 nnnn nnnn nnnn nnnn | |
Push immediate data to the return stack. | |||
n | PUSHM | 0000 1110 0000 1110 nnnn nnnn nnnn nnnn | |
Push dmem[n] to the return stack. | |||
PUSHS | sb | 0011 1000 0000 bbbb | |
Push S[b] to return stack. | |||
PUSHW | 0000 0100 0000 1101 | ||
Push W to the return stack | |||
n-1 | REP | 0111 0001 000n nnnn | |
Repeat next instruction n+1 times. Good for 1..32 executions. | |||
REPW | 0111 0001 0001 0000 | ||
Repeat next instruction W+1 times. Good for 1..32 executions. | |||
RET | |||
Return from subroutine. | |||
RETD | 0000 0000 0000 1001 | ||
Return from subroutine, delayed. The instruction immediately following RETD will be executed. | |||
ROL | sb sa | 0101 1011 aaaa bbbb | |
W = S[a] = (S[b] << 1) + MSB(S[b]). | |||
ROLC | sb sa | 0101 1001 aaaa bbbb | |
W = S[a] = (S[b] << 1) + CF. CF = carry. | |||
RORC | sb sa | 0101 1101 aaaa bbbb | |
W = S[a] = (CF << 15) + (S[b] >> 1). CF = carry. | |||
Scc | sa/ra | 0000 cccc aaaa r011 | |
R=0: Store all '1's to S[a] if condition is true, else store all '0's. | |||
R=1: Store all '1's to R[a] if condition is true, else store all '0's. | |||
SKcc | 0000 cccc 0000 0000 | ||
Skip the next instruction if condition is true. Next instruction can't be long. | |||
SKDcc | 0000 cccc 0010 0000 | ||
Skip the next instruction if condition is true. Also perform DROP operation. | |||
STD | sa sb | 0110 1100 aaaa bbbb | |
DataMem[S[b]] = S[a]. | |||
STDD | sa sb | 0110 1111 aaaa bbbb | |
DataMem[S[b]] = S[a], S[b] = S[b] - 1. | |||
STDI | sa sb | 0110 1101 aaaa bbbb | |
DataMem[S[b]] = S[a], S[b] = S[b] + 1. | |||
n | STDM | 0000 0111 0000 0110 nnnn nnnn nnnn nnnn | |
DataMem[n] = S[a]. | |||
STP | sa sb | 0110 1000 aaaa bbbb | |
ProgMem[S[b]] = S[a]. | |||
STPD | sa sb | 0110 1011 aaaa bbbb | |
ProgMem[S[b]] = S[a], S[b] = S[b] - 1. | |||
STPI | sa sb | 0110 1001 aaaa bbbb | |
ProgMem[S[b]] = S[a], S[b] = S[b] + 1. | |||
STSP | sa/ra | 0000 0010 aaaa r101 | |
R=0: S[a] = SP. | |||
R=1: R[a] = RP. | |||
SUB | sa sb sa|w | 0100 c001 aaaa bbbb | |
W = S[a] - S[b], copy result to S[a] if c=1. Save carry (1 if borrow) in CF. | |||
WSL | sb sa | 0101 1010 aaaa bbbb | |
Shift left S[b] one place, shifting MSB(W) into the LSB. W = S[a] = (S[b] << 1) + MSB(W). | |||
WSR | sb sa | 0101 1110 aaaa bbbb | |
Shift right S[b] one place, shifting MSB(W) into the MSB. W = S[a] = (W & maxint) + (S[b] >> 1). | |||
XOR | sa sb sa|w | 0100 c111 aaaa bbbb | |
W = S[a] XOR S[b], copy result to S[a] if c=1. |
Loops and branches are handled with skip instructions and unconditional branches. There are two basic control structures:
Variations of this are Forth-like but IF and WHILE are unconditional jumps. Use SKcc instructions to make them conditional. SKcc conditionally jams a NOP into the instruction pipeline, so it was easy to implement in hardware.
These may be nested as deeply as desired. Branches are resolved in one pass by saving temporary information on the host's data stack. BEGIN and IF leave addresses on the stack. AGAIN and THEN take those addresses and use them to form or resolve branches. You can manipulate the control stack with CSWAP, but you have to understand Forth and the assembler to do this.
If the control structures are very convoluted, you can manually figure the branch displacement and use the n BR instruction. The disassembler provides quick verification.
@ 1.1.1.1 log @Imported sources @ text @@