RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3457

Author: mark@mips.COM (M
Date: Mon, 22 Feb 1988 00:47

170 lines
8890 bytes



Several articles have recently appeared, alluding to a CMOS  uP
built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
<9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.

These USENET articles mention that the chip, called the "GE RPM-40",
runs a reduced instruction set, operates from 40 MHz clocks, and will
be described at ISSCC (International Solid State Ciruits Conference)
on February 17th.

The paper has now been delivered and published.  The authors were
David Lewis, Theodore Wyman, Mark French, and Frederic Boericke
(no acknowledgments were presented).

Here are a few items of interest on the RPM-40, obtained from the
oral presentation and the printed digest of technical papers.  No
analysis or critique is attempted; only a dump of raw data.  The
most noticeable unknowns are marked with a double asterisk **;
perhaps others can fill in these gaps (if the data isn't secret).
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

			GE RPM-40 CMOS MICROPROCESSOR

1.  The chip was built under a DOD contract.  It is one of several
    implementations under this contract.  There are at least three:
    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.


2.  The instruction set is "DARPA MIPS, core ISA (instruction set
    archictecture)".  In the GE chip, instructions are 16 bits long.
    They are fetched from Instruction Memory two-at-a-time (making
32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

Here is the summary chart of the instruction set:
***************************************************************************
*             15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   *
*           +-----------------------------------------------------------+ *
* ALU       | 0   0 | i |    opcode     |     src1/dest     | src2/imm  | *
*           +-----------------------------------------------------------+ *
* COND      | 0   1 | i |     test      |        src1       | src2/imm  | *
*           +-----------------------------------------------------------+ *
* LD        | 1   0   0 | s |     dest      |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* ST        | 1   0   1 | s |    source     |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* XPLD      | 1   1   0   0 |   xp-field    |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* BRA       | 1   1   0   1 |           branch displacement             | *
*           +-----------------------------------------------------------+ *
* PFX       | 1   1   1   0 |             prefix-immediate              | *
*           +-----------------------------------------------------------+ *
* XPINS     | 1   1   1   1 |         co-processor instruction          | *
*           +-----------------------------------------------------------+ *
***************************************************************************


The ALU format has two register specifiers; presumably you can code
"R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

The Store format has a source register, a base register, and a 4-bit
offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

Branch instructions _seem_ to have only a 12-bit displacement field;
there doesn't appear to be a "Branch Register", "Branch And Link",
or "Conditional Branch" instruction.  Perhaps the "COND" instruction
is the conditional-skip instruction recently mentioned on the net**.

ALU ops can have a 4-bit immediate field.  If this is too small, the
"PREFIX" instruction contains a 12-bit prefix that can be concatenated
to the immediate, to create a 16-bit immediate value.  Perhaps the
PREFIX instruction can be used with loads, stores, and conditionals
too. **

There are 21 32-bit registers; I _believe_ these are arranged as
16 general-purpose registers, plus 5 hardware stacks/queues (used in
exception processing) that are mapped into the register space. **

8-bit and 16-bit external data are converted into the internal 32-bit
format by zero-fill (unsigned) or sign-extend (signed).  This is to
fulfill the DOD requirement for byte and halfword support.  With only
a single "s" bit in the opcode it is difficult to see how these
instructions are encoded (load byte, load haldword, load word) "cross"
(signed, unsigned). **


3.  A four-stage instruction pipeline is used (except for loads, see
    below): Instruction Fetch, Instruction Decode, ALU, Writeback.
    Address calculations (branch addresses or operand addresses) are
performed in the ALU.


4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
    per second.  For the DOD, they benchmarked on a standard US Air Force
    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
value on that mix is 14 MIPS", the speaker said.


5.  The GE implementation uses a Harvard bus structure, with completely
    seperate Instruction Memory and Operand Memory.  GE currently is
    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
At present there is no way to increase the amount of physical memory
(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
for "embedded applications".


6.  There is a "branch target instruction cache" which consists of 32
    entries.  Each entry holds 5 instructions (10 bytes).  When a branch
    occurs, the chip looks (fully associatively) to see whether it holds
the instruction at the branch target address in its cache.  If a hit
(target instruction present) occurs, then the branch target instruction,
and the next 4 instructions, are read from the on-chip cache. Meanwhile
the off-chip Imem is readying itself to begin delivering the 6th thru Nth
instructions after the branch.  Claimed hit rates of the branch target
instruction cache are > 90%.  On a miss there is a 3-cycle latency to get
the Imem SRAM chips delivering instructions (and updating the b.t.i. cache).


7.  The instruction memory contains a "lookahead counter".  This lessens
    traffic on the address bus; instruction addresses only squirt out of
    the CPU after a branch .... leisurely reloading the counter while the
branch target instruction cache supplies the 5 instructions after a branch.


8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
    doesn't use the target register of a load until > 3 instructions after
    the load ("3 load delay slots" in some folks' parlance), then there
is no interlock and instructions are issued one per cycle.  If you use
the target register of a load <= 3 cycles later, there is a pipeline stall
while waiting for the Operand Memory to supply the data.

Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
constraints that prevent 1 store per cycle always, nor did they compare
and contrast loads vs. stores. **


9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
    opcode plus 12 bit coprocessor instruction type) are passed through
    the CPU, and sent over the address bus to the coprocessor(s).  They
can be stored in the branch target address cache.  So it _appears_ that
two cycles are required to do a coprocessor op, one to communicate it
from the CPU to the coprocessor and one to do it **.  GE didn't say
whether there were architecturally-visible register files on the
coprocessors **, but there _appears_ to be an "Xternal Processor Load"
instruction **.  The Floating Point coprocessor is in fab now and is
expected out this month.


10. The CPU chip contains 92,000 transistors and is housed in a 132 pin
    package.  The design style is fully static which is helpful for
    achieving radiation-hard parts.  40 pins are inputs, 46 pins are
outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins &
7 Ground pins.  No mention was made of whether this package configuration
had been "certified" to run at 40 MHz, nor what agency would perform such
certifications. **  The fab process is 1.2 micron bulk CMOS.


11. A simple virtual memory scheme called "most significant bit replacement"
    is used.  A process-id is appended to the MSB's of an address before
    sending it out of the CPU.  A special case occurs when those bits
are all-0's or all-1's.... ** **

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3467

Author: bcase@apple.UUCP
Date: Mon, 22 Feb 1988 19:21

25 lines
1161 bytes

In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>
>
>Several articles have recently appeared, alluding to a CMOS  uP
>built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.
>
>Branch instructions _seem_ to have only a 12-bit displacement field;
>there doesn't appear to be a "Branch Register", "Branch And Link",
>or "Conditional Branch" instruction.  Perhaps the "COND" instruction
>is the conditional-skip instruction recently mentioned on the net**.

Allen Baum (who attended the conference) told me that the single branch
instruction is only available in the conditional form.  Thus, for
an unconditional branch, you must make sure that you know the state of
the single boolean bit (compares test a condition and set the state of
the boolean bit).

>11. A simple virtual memory scheme called "most significant bit replacement"
>    is used.  A process-id is appended to the MSB's of an address before
>    sending it out of the CPU.  A special case occurs when those bits
>are all-0's or all-1's.... ** **

Isn't this the original Stanford MIPS scheme?

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3476

Author: jesup@pawl20.paw
Date: Tue, 23 Feb 1988 08:22

166 lines
8514 bytes

In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
=These USENET articles mention that the chip, called the "GE RPM-40",
=runs a reduced instruction set, operates from 40 MHz clocks, and will
¾ described at ISSCC (International Solid State Ciruits Conference)
=on February 17th.

=The
=most noticeable unknowns are marked with a double asterisk **;
=perhaps others can fill in these gaps (if the data isn't secret).

	To my knowlege, every thing I say in this article is public
information.  (I was on the RPM-40 software team for 1 year, until July 87.)

=1.  The chip was built under a DOD contract.  It is one of several
=    implementations under this contract.  There are at least three:
=    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
=Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
=different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.

	Also there's Sperry/UniSys (also CMOS).  It's not suprising that the
GaAs people use longer pipelines, they can't do much in that time, and are
restricted on transistors.

=2.  The instruction set is "DARPA MIPS, core ISA (instruction set
=    archictecture)".  In the GE chip, instructions are 16 bits long.
=    They are fetched from Instruction Memory two-at-a-time (making
2 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

	All the machines listed above are designed so that 'Core ISA' (a
generic RISC assembly language, designed by Dr Gross of CMU) can be translated
to their native assembly languages.

=The ALU format has two register specifiers; presumably you can code
="R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

	Correct, r3 = r4 + r1 becomes r3 = r4; r3 = r3 + r1.

=The Store format has a source register, a base register, and a 4-bit
=offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

=Branch instructions _seem_ to have only a 12-bit displacement field;
=there doesn't appear to be a "Branch Register", "Branch And Link",
=or "Conditional Branch" instruction.  Perhaps the "COND" instruction
=is the conditional-skip instruction recently mentioned on the net**.

	Any of those displacements can be prefixed by PFX instruction(s)
to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
the next instruction, they can be 'stacked' to provide complex conditionals.

=ALU ops can have a 4-bit immediate field.  If this is too small, the
="PREFIX" instruction contains a 12-bit prefix that can be concatenated
=to the immediate, to create a 16-bit immediate value.  Perhaps the
=PREFIX instruction can be used with loads, stores, and conditionals
=too. **

	Yes, but you can use up to 3 prefixes to get 32 bit constants (in
reality, 32 bits are not used very often.)

=There are 21 32-bit registers; I _believe_ these are arranged as
 general-purpose registers, plus 5 hardware stacks/queues (used in
=exception processing) that are mapped into the register space. **

	Minor error, there are 21 gp registers, plus a number of special
purpose registers, mostly reserved to supervisor mode.  Several are stacks
for internal state mapped into register slots.  User available registers
are the PC, Trap register, sr2 (has various flags), and the Size register
(determines the size of non-word LD/ST, allows some register remapping,
and a bit for doing 16-bit overflow detection instead of 32).

=8-bit and 16-bit external data are converted into the internal 32-bit
=format by zero-fill (unsigned) or sign-extend (signed).  This is to
=fulfill the DOD requirement for byte and halfword support.  With only
=a single "s" bit in the opcode it is difficult to see how these
=instructions are encoded (load byte, load haldword, load word) "cross"
=(signed, unsigned). **

	There are state bits in the size register that control some of
this.  The 's' bit specifies "load word" or "load not word" (type defined
by size bits, usually you're only playing with one non-word type).

=4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
=    per second.  For the DOD, they benchmarked on a standard US Air Force
=    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
=value on that mix is 14 MIPS", the speaker said.

	DAIS is a 1750a (Air Force Standard CPU) mix of instructions, the
DAIS timings are heavily FPU dependant and are in 1750a MIPS, not RPM-40!

=5.  The GE implementation uses a Harvard bus structure, with completely
=    seperate Instruction Memory and Operand Memory.  GE currently is
=    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
=for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
=At present there is no way to increase the amount of physical memory
=(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
=for "embedded applications".

	Well.... The current board has 128K, but the CPU supports full
32-bit addressing.  Nothing says you can't put more than 128K on it, or use
some sort of external cache.  The only limits are the amount of capacitance
the CPU can drive at 40 Mhz.

=8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
=    doesn't use the target register of a load until > 3 instructions after
=    the load ("3 load delay slots" in some folks' parlance), then there
=is no interlock and instructions are issued one per cycle.  If you use
=the target register of a load <= 3 cycles later, there is a pipeline stall
=while waiting for the Operand Memory to supply the data.

	That is only a software stall, eg NOP-insertion.  Of course, the
reorganizer will try to fill it.  Note that the 7 & 4 cycle figures include
all pipe stages, including the illusionary IF stage.

=Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
=constraints that prevent 1 store per cycle always, nor did they compare
=and contrast loads vs. stores. **

	There are some interlocks with other address-bus using instructions.
You can string as many stores in a row you want, or as many loads.

=9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
=    opcode plus 12 bit coprocessor instruction type) are passed through
=    the CPU, and sent over the address bus to the coprocessor(s).  They
Ên be stored in the branch target address cache.  So it _appears_ that
=two cycles are required to do a coprocessor op, one to communicate it
=from the CPU to the coprocessor and one to do it **.  GE didn't say
=whether there were architecturally-visible register files on the
=coprocessors **, but there _appears_ to be an "Xternal Processor Load"
=instruction **.  The Floating Point coprocessor is in fab now and is
=expected out this month.

	The CPU doesn't have to wait, it just issues the instruction over
the address bus.  There is an XPLoad instruction, coprocessor dependant.

. A simple virtual memory scheme called "most significant bit replacement"
=    is used.  A process-id is appended to the MSB's of an address before
=    sending it out of the CPU.  A special case occurs when those bits
=are all-0's or all-1's.... ** **

	Tasks can be allocated memory under this scheme in power-of-two
sized chunks == 256 bytes.  Of course, instructions and data have different
mappings.

=++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***
=UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
=US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

	I hate to admit this, but it was decided that Core ISA mandated
little-endian memory layout, since several other Core ISA users had implemented
their CPUs that way already when we questioned it.  (Will little-endianism
dog out heels forever? :-)

	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
unix box envirionment 68020.)

{ WARNING:  this is VERY ROUGH, and though I have calulations available that
            say this, they are very back-of-napkin style!  However, it's
	    probably not TOO far off.  Maybe we'll have real performance
	    figures at some point from GE (I don't work there anymore). }

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3482

Author: oconnor@sunset.s
Date: Tue, 23 Feb 1988 16:06

39 lines
1739 bytes

An article by bcase@apple.UUCP (Brian Case) says:
] In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
] >
] >
] >Several articles have recently appeared, alluding to a CMOS  uP
] >built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
] ><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.
] >
] >Branch instructions _seem_ to have only a 12-bit displacement field;
] >there doesn't appear to be a "Branch Register", "Branch And Link",
] >or "Conditional Branch" instruction.  Perhaps the "COND" instruction
] >is the conditional-skip instruction recently mentioned on the net**.
]
] Allen Baum (who attended the conference) told me that the single branch
] instruction is only available in the conditional form.  Thus, for
] an unconditional branch, you must make sure that you know the state of
] the single boolean bit (compares test a condition and set the state of
] the boolean bit).

Allen Baum has misinterpretted. Branches are conditional-ized just
like any other instruction (except PREFIX). If and only if the branch
(and its PREFIXes, if any) are preceeded by one or more COND
instructions (and there PREFIXes, if any) is the branch conditional.

] >11. A simple virtual memory scheme called "most significant bit replacement"
] >    is used.  A process-id is appended to the MSB's of an address before
] >    sending it out of the CPU.  A special case occurs when those bits
] >are all-0's or all-1's.... ** **
]
] Isn't this the original Stanford MIPS scheme?

It's an enhancement of the original Stanford scheme.


--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3495

Author: mash@mips.COM (J
Date: Wed, 24 Feb 1988 05:20

37 lines
1819 bytes

In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
..
>=or "Conditional Branch" instruction.  Perhaps the "COND" instruction
>=is the conditional-skip instruction recently mentioned on the net**.

>	Any of those displacements can be prefixed by PFX instruction(s)
>to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
>the next instruction, they can be 'stacked' to provide complex conditionals.

I assume that cond skips the next instruction, including the PFX's??

>	Minor error, there are 21 gp registers, plus a number of special
>purpose registers, mostly reserved to supervisor mode.  Several are stacks
>for internal state mapped into register slots.  User available registers
>are the PC, Trap register, sr2 (has various flags), and the Size register
>(determines the size of non-word LD/ST, allows some register remapping,
>and a bit for doing 16-bit overflow detection instead of 32).

How do you address more than 16 gp regs, given the encoding?

>	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
>16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
>unix box envirionment 68020.)

I.e., assuming that such 68020s are around 2 vax-mips, this sounds like
about 14-18 vax-mips, roughly.

>{ WARNING:  this is VERY ROUGH, and though I have calulations available that
>            say this, they are very back-of-napkin style!  However, it's
>	    probably not TOO far off.  Maybe we'll have real performance
>	    figures at some point from GE (I don't work there anymore). }
--
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Re: RPM-40 microprocessor @ 40 MHz; dat

#3502

Author: aglew@ccvaxa.UUC
Date: Wed, 24 Feb 1988 17:23

16 lines
762 bytes


>=2.  The instruction set is "DARPA MIPS, core ISA (instruction set
>=    architecture)".  In the GE chip, instructions are 16 bits long.
>=    They are fetched from Instruction Memory two-at-a-time (making
>2 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.
>
>	All the machines listed above are designed so that 'Core ISA' (a
>generic RISC assembly language, designed by Dr Gross of CMU) can be translated
>to their native assembly languages.

Okay, what about this MIPS-like ISA? Will it be assembly language only,
or binary? Will it be possible to run some form of program intermediate
between C and actual assembly through a translator to move between these
families - and will third party software vendors distribute that portable
form?

Re: RPM-40 microprocessor @ 40 MHz; dat

#3532

Author: aglew@ccvaxa.UUC
Date: Fri, 26 Feb 1988 16:52

35 lines
1553 bytes


.> Prefix instructions in the GE RPM-40

I like this idea.

(I should - I used it in a school project back in '84, before I knew details
of the Transputer - I think I got it from an earlier architecture, melded
with the 8088's PREFIX instructions.)

I particularly like how it begins to let the instruction set get independent
of the register size (so long as people do not expect 1<<32 == 0)

A question, though: how would you compare PREFIX to an instruction SHIFT and
OR --  SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually
require a specification for one of several literal fields it is extending,
plus it requires state to be saved on interrupts, which leans towards
assembling the constant in a register.
    On the other hand, you can always build a decoder that never puts prefix
into a register at all, but takes prefix and the prefixed instruction as
one packet. This is nice, and makes it a pity to require the register write.

What do people (particularly the RPM-40 people) feel on this?



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?

My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3560

Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:29

63 lines
2845 bytes

In article <1666@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>>	Any of those displacements can be prefixed by PFX instruction(s)
>>to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
>>the next instruction, they can be 'stacked' to provide complex conditionals.

>I assume that cond skips the next instruction, including the PFX's??

	Yup.  The COND instruction actually skips the next (non-PFX,non-COND)
instruction.  Essentially, it acts as though PFX is part of the instruction
after it.  Example:

		COND GT,.r1,.r2
		PFX  #$xxx
		COND GE,.r1,#$yy
		PFX  #$qqq
		ADD  .r1,#zz
		MOV  .r2,.r1

If the either cond fails, control goes to the MOV instruction.  Of course,
you would write PFX's in yourself, the assembler does them for you auto-
magicly.

>>	Minor error, there are 21 gp registers, plus a number of special
>>purpose registers, mostly reserved to supervisor mode.  Several are stacks
>>for internal state mapped into register slots.  User available registers
>>are the PC, Trap register, sr2 (has various flags), and the Size register
>>(determines the size of non-word LD/ST, allows some register remapping,
>>and a bit for doing 16-bit overflow detection instead of 32).
>
>How do you address more than 16 gp regs, given the encoding?

	In general, the destination of ALU ops can be any register 0-31.
However, for most ALU ops the source must be in regs 0-15.  There are
two ways around this:
	1)  There are two instructions that reverse the meanings of "source"
	    and "destination".  These are RMOV (reverse move) and RADD (reverse
	    add).  These allow moving the higher registers to the lower
	    or adding them into the lower (two high-freqency ops).
	2)  There is a bit that allows swapping of the regs 8-13 and regs
	    16-21.
Note that loads and stores also must only use regs 0-15.

	There is no guarantee the higher registers will be extremely useful,
but they are very useful for things like temps, or passing args, or
accumulators, etc.  The swap feature can make them much more useful, but
requires more work to use.

>>	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
>>16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
>>unix box envirionment 68020.)
>
>I.e., assuming that such 68020s are around 2 vax-mips, this sounds like
>about 14-18 vax-mips, roughly.

	That seems to jibe fairly well.  Of course, only real benchmarks
will tell the story, and those depend on compiler tech quite a bit.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

Re: RPM-40 microprocessor @ 40 MHz; dat

#3562

Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:46

27 lines
1380 bytes

In article <28200110@ccvaxa> aglew@ccvaxa.UUCP writes:
>>	All the machines listed above are designed so that 'Core ISA' (a
>>generic RISC assembly language, designed by Dr Gross of CMU) can be translated
>>to their native assembly languages.
>
>Okay, what about this MIPS-like ISA? Will it be assembly language only,
>or binary? Will it be possible to run some form of program intermediate
>between C and actual assembly through a translator to move between these
>families - and will third party software vendors distribute that portable
>form?

	Core ISA is an assembly language for a non-existant machine.  It
is fairly 'RISCy', but includes things like multiply (integer and FP) as
single ops, etc.  It has no relation to ANY existant hardware at all, and was
designed explicitly for the Darpa MIPS project.

	Anything distributed in Core ISA is portable (at least potentially).
All the machines mentioned have Core_ISA->their_assembler translators.
However, I suspect most stuff will be distributed in source (the compilers
produce Core ISA, that's the point of it).  Assembler modules should all be
written in Core as well.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

Re: RPM-40 microprocessor @ 40 MHz; dat

#3563

Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:58

35 lines
1746 bytes

In article <28200112@ccvaxa> aglew@ccvaxa.UUCP writes:
>..> Prefix instructions in the GE RPM-40

>A question, though: how would you compare PREFIX to an instruction SHIFT and
>OR --  SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually
>require a specification for one of several literal fields it is extending,
>plus it requires state to be saved on interrupts, which leans towards
>assembling the constant in a register.

	Pipelining!  You can't use the result of an op in the next
instruction!  So you'd have to devote both a register AND intersperse NOPs
between SHORs.  However, on a machine with loopback of ALU results (may
slow things down) it only costs a register, so it doesn't hurt TOO much
(if you have registers to spare, which you very well might not).

	What are these 'several fields' you refer to?  RPM-40 can only have
1 value that might be extended via prefix in any instruction (immediates
for ALU and COND ops, offset for load/store/branch, xp instruction field
for XPINST, etc.)

>    On the other hand, you can always build a decoder that never puts prefix
>into a register at all, but takes prefix and the prefixed instruction as
>one packet. This is nice, and makes it a pity to require the register write.

	RPM-40 does that now, but handles each prefix as it comes along
(there are some hidden resources being used).  What you imply would complicate
the decoder a lot.

>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

Re: RPM-40 microprocessor @ 40 MHz; dat

#3578

Author: oconnor@sunset.s
Date: Mon, 29 Feb 1988 18:04

48 lines
2047 bytes

An article by aglew@ccvaxa.UUCP says:
]
] ..> Prefix instructions in the GE RPM-40
]
] I like this idea.
] [...]
] I particularly like how it begins to let the instruction set get independent
] of the register size (so long as people do not expect 1<<32 == 0)
]
] A question, though: how would you compare PREFIX to an instruction SHIFT and
] OR --  SHOR r,lit ::== r := (r<<14)|lit?

PREFIX builds immidiate values that can then be added, ored,
subtracted or whatever to anything you like. It does not use
a user register to do this (minor win). And it does NOT access
the register file, or use the ALU. In a pipelined system
this is significant : PREFIX as implimented in RPM40 have no latency
problems (major win). SHOR would have latency problems.

] PREFIX always seems to eventually
] require a specification for one of several literal fields it is extending,
] plus it requires state to be saved on interrupts, which leans towards
] assembling the constant in a register.

RPM40 instructions only have one field that can possibly be an
immediate operand, why more ? Any operations on two constants should
be done at compile or load time, I think. Given you have a
reverse-subtract instruction ( normal = op1-op2, reverse = op2-op1 )
I don't see the need for two "immidiate-able" fields.

Yes, the prefix register needs to be saved on a context switch, and in
fact has to have a old value saved. This is not really a big deal.

]     On the other hand, you can always build a decoder that never puts prefix
] into a register at all, but takes prefix and the prefixed instruction as
] one packet. This is nice, and makes it a pity to require the register write.

This is a good idea, especially if you can fetch instructions faster
than you can execute (non-PREFIX) instructions.

] Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801


--
    Dennis O'Connor			      UUNET!steinmetz!sunset!oconnor
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

Re: RPM-40 microprocessor @ 40 MHz; dat

#3603

Author: mash@mips.COM (J
Date: Tue, 01 Mar 1988 09:02

26 lines
1183 bytes

In article <9727@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes:
..
>] A question, though: how would you compare PREFIX to an instruction SHIFT and
>] OR --  SHOR r,lit ::== r := (r<<14)|lit?
>
>PREFIX builds immidiate values that can then be added, ored,
>subtracted or whatever to anything you like. It does not use
>a user register to do this (minor win). And it does NOT access
>the register file, or use the ALU. In a pipelined system
>this is significant : PREFIX as implimented in RPM40 have no latency
>problems (major win). SHOR would have latency problems.

Why would it have latency problems? None of the popular RISCs have
latency problems with r = r op literal for the usual ops.
I.e., any high-performance system is likely to make use of
register-bypassing anyway, so that:
	r = r op literal
	r = r op r
has zero intervening latency (the performance penalty of a
cycle's latency for such things is large).
--
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Re: RPM-40 microprocessor @ 40 MHz; dat

#3610

Author: aglew@ccvaxa.UUC
Date: Tue, 01 Mar 1988 16:23

10 lines
515 bytes


>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
>course they are not there.  No direct support on CPU for them either.  I will
>say more on this issue when the FPU is formally announced.  You can do them
>in the CPU in software if you want, takes a few cycles though.

If your customers spend time doing multiplies or divides, then your RISC
designer will put them in. Cray is the only "RISCy" machine that is widely
known with multiply that springs to mind, though. Same for floating point.

Re: RPM-40 microprocessor @ 40 MHz; dat

#3616

Author: davidsen@steinme
Date: Tue, 01 Mar 1988 17:35

22 lines
1009 bytes

In article <444@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
| [...]
| 	Core ISA is an assembly language for a non-existant machine.  It
| is fairly 'RISCy', but includes things like multiply (integer and FP) as
| single ops, etc.  It has no relation to ANY existant hardware at all, and was
| designed explicitly for the Darpa MIPS project.
|
| 	Anything distributed in Core ISA is portable (at least potentially).
| All the machines mentioned have Core_ISA->their_assembler translators.

If it clarifies the situation, ISA is functionally similar to the old
UCSD P-system, and I don't see any technical reason why it couldn't be
interpreted instead of translated and compiled.

For history bufs, the original "B" language compiler compiled to P-code,
which was then used to generate assembler. We had a P-code interpreter
on several machines.
--
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

Re: RPM-40 microprocessor @ 40 MHz; dat

#3626

Author: tim@amdcad.AMD.C
Date: Tue, 01 Mar 1988 20:18

19 lines
795 bytes

In article <445@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
| 	Pipelining!  You can't use the result of an op in the next
| instruction!  So you'd have to devote both a register AND intersperse NOPs
| between SHORs.  However, on a machine with loopback of ALU results (may
| slow things down) it only costs a register, so it doesn't hurt TOO much
| (if you have registers to spare, which you very well might not).

Interesting... this is the first RISC processor I have heard of that did
not implement operand {forwarding/bypassing/other names?} around the
ALU.  What prompted the elimination of this feature?  Do you have any
statistics on how many additional nops/stalls are required?

Thanks for any info...


	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

Re: RPM-40 microprocessor @ 40 MHz; dat

#3690

Author: bron@olympus.SGI
Date: Thu, 03 Mar 1988 21:34

13 lines
640 bytes

In article <28200116@ccvaxa>, aglew@ccvaxa.UUCP writes:
> If your customers spend time doing multiplies or divides, then your RISC
> designer will put them in. Cray is the only "RISCy" machine that is widely
> known with multiply that springs to mind, though. Same for floating point.

FYI, The Cray XMP machines do NOT have hardware support for a general
integer (64bit) multiply.  They can do address length (24bit) integer
multiplies.  It has no hardware for integer divide (of any length).
If you need these operations, you have to convert to floating point.
------
Bron Nelson   bron@sgi.com
Don't blame my employers for my opinions.

Re: RPM-40 microprocessor @ 40 MHz; dat

#3701

Author: wcs@ho95e.ATT.CO
Date: Fri, 04 Mar 1988 05:22

28 lines
1501 bytes

In article <28200116@ccvaxa> aglew@ccvaxa.UUCP writes:
:
:>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
:>course they are not there.  No direct support on CPU for them either.  I will
:>say more on this issue when the FPU is formally announced.  You can do them
:>in the CPU in software if you want, takes a few cycles though.
:
:If your customers spend time doing multiplies or divides, then your RISC
:designer will put them in. Cray is the only "RISCy" machine that is widely
:known with multiply that springs to mind, though. Same for floating point.

The AT&T Digital Signal Processor chips are RISCy, and do single-instruction
multiplies, because that's what the chips' customers do.  The DSP-32 does
32-bit floating point - each cycle does an add and a multiply if you want them,
and/or 16-bit integer ops; I think the pipeline is 4 deep for multiplies.
The original chip did 4 Million cycles/sec (16MHz clock?); the current version
does 6 Million.  The next generation will be faster.  The current chip also
includes serial and parallel I/O hardware, but only 64K address space;
the next will be more general.

The DSP-16 does 16-bit integers (multiplies into 36 bits); it's got very
limited memory (1-4K on chip), and has a more limited instruction set, but the
16 - 19 million cycles/sec do a multiply and/or add as well as separate integer
ops for address calculation.
--
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

#3703

Author: kers@otter.hple.
Date: Fri, 04 Mar 1988 08:47

18 lines
754 bytes

Version 2 of the Acorn Risc Machine has two multiply instructions (one with,
one without, accumulate), but no divide instruction.

At a seminar I attended, the designer* said that (a) they could fit it on the
chip, and (b) it afforded enough performance increase to be an acceptable
overhead (rather than having a multiply-step, or doing it with shift-and-add).

Mildly surprising, considering the shiftable-register-source in the data
manipulation instructions (gives you multiplies by constants of the form 2^n,
2^(n+1), 2^(n-1) in one instruction). Could it be something to do with having
interpreted BBC Basic as a principal language, so there isn't a compiler to
notice that E*K can be done speedily?

Regards,
Kers.

* well, one of the designers.

Re: RPM-40 microprocessor @ 40 MHz; dat

#3730

Author: jesup@pawl23.paw
Date: Sat, 05 Mar 1988 07:52

37 lines
1845 bytes

In article <1729@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>I.e., any high-performance system is likely to make use of
>register-bypassing anyway, so that:
>	r = r op literal
>	r = r op r
>has zero intervening latency (the performance penalty of a
>cycle's latency for such things is large).

>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

	Two reasons why one might not have register bypassing:

1)  Slows down critical path.  Any finely tuned risc CPU will most probably
have it's cycle time determined by the latency through the ALU.  Using a
loopback of ALU results might result (depending on layout, tech, etc) in up
to a 20% slowdown in the ALU, plus increase the chip area and layout
problems.  This doesn't mean a loopback is a loss necessarily, but that it
does have a measurable cost which must be weighed against the benefits.

2)  In combination with (1) above, I'm not sure that having a one-cycle delay
in ALU results causes any large loss.  A good reorganizer can fill those
latencies, or move the ALU op forward into, for example, a load delay.  In
high-speed (> 15 Mhz) RISCs (and maybe slower ones as well), load delays
are usually the determining factor, or a large part of it.  What studies do you
have that compare RISC's with 1 cycles ALU delays and 0-cycle?  I'd like to
see anything you can drag up.

3)  If one is doing much FP, the CPU is usually waiting on results from the
FPU anyway, so you may not lose anything.  (I know I said 2, but....)

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

Re: RPM-40 microprocessor @ 40 MHz; dat

#3817

Author: aglew@ccvaxa.UUC
Date: Fri, 11 Mar 1988 16:11

9 lines
252 bytes


>Which is not the standard Unix
>profile (except for things like Crays).

Ayoi! You don't have to buy a multimillion dollar
supercomputer to get a floating point oriented
system that runs UNIX. Consider Gould (and, to be fair,
Alliant, Convex, etc.)

🚀 go-pugleaf

Thread View: comp.arch

RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC

Re: RPM-40 microprocessor @ 40 MHz; dat

Re: RPM-40 microprocessor @ 40 MHz; dat

Thread Navigation