head 1.1; branch 1.1.1; access ; symbols start':1.1.1.1 cd16:1.1.1; locks ; strict; comment @# @; 1.1 date 2003.08.15.17.26.02; author beckert; state Exp; branches 1.1.1.1; next ; 1.1.1.1 date 2003.08.15.17.26.02; author beckert; state Exp; branches ; next ; desc @@ 1.1 log @Initial revision @ text @ CD16 Programmer's Manual

CD16 Programmer's Manual

Stack machines (Forth chips) usually fall into two categories. The first category is the Novix NC4016 and descendants. The NC4016 uses an unencoded instruction format. Various instruction groups are formatted in independent fields of bits that simultaneously control different parts of the machine, much like horizontal microcode. So while the instructions are fairly simple, several of them may be performed per cycle.

The second category is Chuck Moore's MISC (Minimal Instruction Set Computer) architecture. MISC packs several 5-bit instructions into a program word. Chuck Moore has special design tools for asynchronous CPU design, leading to MISC CPUs that are very fast. The rest of us are stuck with synchronous design tools. There are several FPGA-based CPU designs based on Chuck's work. Instructions are decoded one clock cycle at a time. With three or more instructions per program word, MISC alleviates the memory bottleneck. But in a SOC, program memory is on-chip. Without the bottleneck, it might be better to take advantage of the memory's bandwidth with more complex instructions.

The drawback with traditional stack machines is that instructions are wasted shuffling around parameters. Forth primitives operate on top of the stack to manipulate implicit parameters. But, you have to look at what these instruction sequences are doing. In an algorithm with more than a few parameters, a lot of shuffling is going on. The expense of this shuffling is hard to quantify, being dependent on the application. The CD16 is a general purpose CPU intended for moderate number crunching, so we assume that there will be a lot of parameter shuffling.

The CD16 has 4-bit stack address fields that allow operations to be performed on the stack without the prerequisite shuffling. An optimizer folds shuffling activity into the instructions used to do useful work. Parameters are divided into two groups: Stack and Global. Stack parameters address the top eight elements of the data or return stack. Global parameters address eight global registers.

The CD16 uses a semi-encoded instruction format. With the parameter address bits, it doesn't have very many bits left for the unencoded approach used by the NC4016. The ISA is divided into nine groups, each with its own format. The instruction set is a collection of useful instructions that fall out of the implementation. Other instructions are possible but not documented. For these, you can refer to the CPU's HDL code. The simulator is based on the hardware RTL model, so you are free to use undocumented instructions if you find some that are useful.

There is good support for Forth locals, which are stored in a frame on the return stack. C stack frames can be set up on the return stack, allowing a C compiler to generate reasonable code for the CD16.

Instruction Set

This is a guide to the CD16 instruction set. In case of ambiguity, the CPU's RTL model is the definitive specification.

These mnemonics are used by the assembler. The disassembler may display in either mnemonic or algebraic format. Algebraic format will usually be used, so you don't have the extra step of looking up cryptic mnemonics in the manual.

Operands:	Range:
Sa = A operand: 0..7 = top of stack, 8..F = global register 0..7.	S0 ... S7, G0 ... G7
Sb = B operand: 0..7 = top of stack, 8..F = global register 0..7.	S0 ... S7, G0 ... G7
Ra = A operand: 0..7 = top of return stack, 8..F = global 0..7.	R0 ... R7, G0 ... G7
K6 = 6-bit signed literal	-32 ... 31
Ria = 6-bit index into the return stack	S0 ... S31
Sia = 6-bit index into the data stack	R0 ... R31

Condition Codes

Condition codes for Scc and SKcc are encoded using bits 11:8.

*cccc=*	*extension*	*True when*	*Comparison*
0000	N	Never
0001	A	Always
0010	LS	W=0 or C=0	unsigned <=
0011	HI	W<>0 and C=1	unsigned >
0100	CS	C = 1	unsigned >=
0101	CC	C = 0	unsigned <
0110	EQ	W = 0	=
0111	NE	W <> 0	<>
1000	VS	V = 1
1001	VC	V = 0
1010	MI	W < 0
1011	PL	W >= 0
1100	LT	W>0 xor V=1	signed <
1101	GE	W<0 xor V=0	signed >=
1110	LE	W<=0 xor V=1	signed <=
1111	GT	W<=0 xor V=0	signed >

Standard Instructions

Name		Operands	Encoding
ADD		*sa sb sa\|w*	0100 c010 aaaa bbbb
W = S[a] + S[b], copy to S[a] if c=1. Save carry in CF
ADDC		*sa sb sa\|w*	0100 c011 aaaa bbbb
W = S[a] + S[b] + CF, copy to S[a] if c=1. Save carry in CF
ADDNC		*sa sb sa\|w*	0100 c000 aaaa bbbb
W = S[a] + S[b], copy to S[a] if c=1. Doesn't modify CF.
k	ADDRP		0000 10kk kkkk 1100
Add 6-bit signed k to RP.
k	ADDSP		0000 10kk kkkk 0100
Add 6-bit signed k to SP.
AND		*sa sb sa\|w*	0100 c101 aaaa bbbb
W = S[a] AND S[b], copy to S[a] if c=1.
ASR		*sb sa*	0101 1111 aaaa bbbb
W = S[a] = S[b] / 2. CF = carry.
n	BANK		0111 0001 01nn nnnn
Select the bank to use for extended program memory access.
*dest*	BR		0001 dddd dddd dddd
Relative branch. The destination of the branch is P+d where d is a signed 12-bit displacement.
n	COP	sb	0010 nnnn nn00 bbbb
Coprocessor S[b], instruction n.
n	COPI	sb	0010 nnnn nn10 bbbb
Coprocessor S[b], instruction n. Increment SP.
n	COPW	sb	0010 nnnn nn10 bbbb
Coprocessor S[b], instruction n. Write result to S[b].
n	COPWI	sb	0010 nnnn nn11 bbbb
Coprocessor S[b], instruction n. Write result to S[b]. Increment SP.
CALL		*dest*	1ddd dddd dddd dddd
Absolute subroutine call to 2*d. Covers 64K program space.
CRC		*sa sb*	0111 0111 aaaa bbbb
CRC step. Copies result to W. If W[n]=1 then S[b] = S[b] * 2 ^ S[a] else S[b] = S[b] * 2.
DEC		*sb sa*	0101 0011 aaaa bbbb
S[a] = S[b] - 1.
DIV1		*sb sa*	0111 0100 aaaa bbbb
Division step 1. If carry(S[a]+S[b])=1 then S[a] = S[a] + S[b]. Latch carry in CF.
DIV2		sb	0111 0101 0000 bbbb
Division step 2. W:S[b] = W:S[b] * 2 + CF.
DROP			0000 0000 0010 0000
Pop and discard top of data stack.
EXEC			0000 0100 0000 1111
Call the word at the address on top of the return stack.
IDLE			0000 0000 0000 1000
Suspend CPU operation until an interrupt request.
IJMP			0000 0100 0000 0001
Indirect jump. Pops jump address off the data stack.
IJMPD			0000 0000 0000 0001
Indirect jump. Pops jump address off the data stack. The instruction immediately following IJMPD will be executed regardless.
INC		*sb sa*	0101 0001 aaaa bbbb
W = S[a] = S[b] + 1.
INC2		*sb sa*	0101 0010 aaaa bbbb
W = S[a] = S[b] + 2.
JMP		ra	0000 0100 aaaa 0111
Jump: Swap P, S[a].
LDD		*sb sa*	0110 0100 aaaa bbbb
S[a] = pipe. Pipe = DataMem[S[b]].
LDDD		*sb sa*	0110 0111 aaaa bbbb
S[a] = pipe. Pipe = DataMem[S[b]]. S[b] = S[b] - 1.
LDDI		*sb sa*	0110 0101 aaaa bbbb
S[a] = pipe. Pipe = DataMem[S[b]]. S[b] = S[b] + 1.
n	LDDM	sb	0000 0110 0000 0110 nnnn nnnn nnnn nnnn
S[a] = DataMem[n].
LDDS		sb	0110 0110 0000 bbbb
Start data read: Pipe = DataMem[S[b]].
LDP		*sb sa*	0110 0000 aaaa bbbb
S[a] = ProgMem[S[b]].
LDPD		*sb sa*	0110 0011 aaaa bbbb
S[a] = ProgMem[S[b]]. S[b] = S[b] - 1.
LDPI		*sb sa*	0110 0001 aaaa bbbb
S[a] = ProgMem[S[b]]. S[b] = S[b] + 1.
LDPP		sb	0110 0010 0000 bbbb
Push ProgMem[S[b]] onto data stack.
LDRP		ra	0000 0000 aaaa 1010
RP = R[a].
LDSP		sa	0000 0000 aaaa 0010
SP = S[a].
LDW		sb	0111 1000 0000 bbbb
W = S[b], clear CF
n	LIT		0000 1100 0000 0110 nnnn nnnn nnnn nnnn
Push immediate data n to the data stack.
n	LITM		0000 1110 0000 0110 nnnn nnnn nnnn nnnn
Push dmem[n] to the data stack.
LSL		*sb sa*	0101 1000 aaaa bbbb
W = S[a] = S[b] << 1.
LSR		*sb sa*	0101 1100 aaaa bbbb
W = S[a] = S[b] >> 1.
MASK		*sb sa*	0111 0010 aaaa bbbb
S[a] = S[a] AND S[b]
MASKH		*sb sa*	0111 0011 aaaa bbbb
S[a] = ByteSwap(S[a]) AND S[b]
n	MOVL	sa	0000 0000 aaaa 0110 nnnn nnnn nnnn nnnn
Store immediate data n16 to W and S[a]. Two-cycle, two-cell instruction.
n	MOVLR	*ria*	0000 10aa aaaa 1110 nnnn nnnn nnnn nnnn
Store immediate data to R[a].
n	MOVLS	*sia*	0000 10aa aaaa 0110 nnnn nnnn nnnn nnnn
Store immediate data to S[a].
MOVRS		*ria sb*	0011 01aa aaaa bbbb
Move R[a] to S[b].
MOVSR		*sb sia*	0011 00aa aaaa bbbb
Move S[b] to R[a].
MOVS		*sa sb sa\|w*	0100 c100 aaaa bbbb
W = S[b], copy to S[a] if c=1.
MOVSS		*sb sa*	0101 0000 aaaa bbbb
W = S[a] = S[b]
MOVW		sb	0100 0100 0000 bbbb
W = S[b]
MOVWR		ra	0000 0000 aaaa 1101
Store W to R[a].
MOVWS		sa	0000 0000 aaaa 0101
Store W to S[a].
MUL		*sa sb*	0111 0110 aaaa bbbb
Multiplication step. If W[n]=1 then W:S[b] = W:S[b] * 2 + S[a] else W:S[b] = W:S[b] * 2
MULC0		sb	0111 1001 0000 bbbb
W = S[b]/2, clear CF ( begin multiplication )
MULC1		sb	0111 1010 0000 bbbb
W = (W+CF)*2 + B[n], clear CF ( end multiplication )
NEG		*sb sa*	0101 0101 aaaa bbbb
W = S[a] = -S[b].
NOP			0000 0000 0000 0000
No operation.
NOT		*sb sa*	0101 0100 aaaa bbbb
W = S[a] = !S[b].
OR		*sa sb sa\|w*	0100 c110 aaaa bbbb
W = S[a] OR S[b], copy result to S[a] if c=1.
POPS		sb	0011 1100 0000 bbbb
Pop S[b] from return stack.
n	PUSHI		0000 1100 0000 1110 nnnn nnnn nnnn nnnn
Push immediate data to the return stack.
n	PUSHM		0000 1110 0000 1110 nnnn nnnn nnnn nnnn
Push dmem[n] to the return stack.
PUSHS		sb	0011 1000 0000 bbbb
Push S[b] to return stack.
PUSHW			0000 0100 0000 1101
Push W to the return stack
*n-1*	REP		0111 0001 000n nnnn
Repeat next instruction n+1 times. Good for 1..32 executions.
REPW			0111 0001 0001 0000
Repeat next instruction W+1 times. Good for 1..32 executions.
RET
Return from subroutine.
RETD			0000 0000 0000 1001
Return from subroutine, delayed. The instruction immediately following RETD will be executed.
ROL		*sb sa*	0101 1011 aaaa bbbb
W = S[a] = (S[b] << 1) + MSB(S[b]).
ROLC		*sb sa*	0101 1001 aaaa bbbb
W = S[a] = (S[b] << 1) + CF. CF = carry.
RORC		*sb sa*	0101 1101 aaaa bbbb
W = S[a] = (CF << 15) + (S[b] >> 1). CF = carry.
Scc		*sa/ra*	0000 cccc aaaa r011
R=0: Store all '1's to S[a] if condition is true, else store all '0's.
R=1: Store all '1's to R[a] if condition is true, else store all '0's.
SKcc			0000 cccc 0000 0000
Skip the next instruction if condition is true. Next instruction can't be long.
SKDcc			0000 cccc 0010 0000
Skip the next instruction if condition is true. Also perform DROP operation.
STD		*sa sb*	0110 1100 aaaa bbbb
DataMem[S[b]] = S[a].
STDD		*sa sb*	0110 1111 aaaa bbbb
DataMem[S[b]] = S[a], S[b] = S[b] - 1.
STDI		*sa sb*	0110 1101 aaaa bbbb
DataMem[S[b]] = S[a], S[b] = S[b] + 1.
n	STDM		0000 0111 0000 0110 nnnn nnnn nnnn nnnn
DataMem[n] = S[a].
STP		*sa sb*	0110 1000 aaaa bbbb
ProgMem[S[b]] = S[a].
STPD		*sa sb*	0110 1011 aaaa bbbb
ProgMem[S[b]] = S[a], S[b] = S[b] - 1.
STPI		*sa sb*	0110 1001 aaaa bbbb
ProgMem[S[b]] = S[a], S[b] = S[b] + 1.
STSP		*sa/ra*	0000 0010 aaaa r101
R=0: S[a] = SP.
R=1: R[a] = RP.
SUB		*sa sb sa\|w*	0100 c001 aaaa bbbb
W = S[a] - S[b], copy result to S[a] if c=1. Save carry (1 if borrow) in CF.
WSL		*sb sa*	0101 1010 aaaa bbbb
Shift left S[b] one place, shifting MSB(W) into the LSB. W = S[a] = (S[b] << 1) + MSB(W).
WSR		*sb sa*	0101 1110 aaaa bbbb
Shift right S[b] one place, shifting MSB(W) into the MSB. W = S[a] = (W & maxint) + (S[b] >> 1).
XOR		*sa sb sa\|w*	0100 c111 aaaa bbbb
W = S[a] XOR S[b], copy result to S[a] if c=1.

Control Structures

Loops and branches are handled with skip instructions and unconditional branches. There are two basic control structures:

BEGIN ... AGAIN forms a forever loop (to BEGIN).
IF ... THEN forms a forward jump (to THEN).

Variations of this are Forth-like but IF and WHILE are unconditional jumps. Use SKcc instructions to make them conditional. SKcc conditionally jams a NOP into the instruction pipeline, so it was easy to implement in hardware.

IF not_executed ELSE executed_code THEN.
BEGIN executed_code WHILE not_executed REPEAT.

These may be nested as deeply as desired. Branches are resolved in one pass by saving temporary information on the host's data stack. BEGIN and IF leave addresses on the stack. AGAIN and THEN take those addresses and use them to form or resolve branches. You can manipulate the control stack with CSWAP, but you have to understand Forth and the assembler to do this.

If the control structures are very convoluted, you can manually figure the branch displacement and use the n BR instruction. The disassembler provides quick verification.

@ 1.1.1.1 log @Imported sources @ text @@