Thread View: comp.arch
20 messages
20 total messages
Started by mark@mips.COM (M
Mon, 22 Feb 1988 00:47
RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: mark@mips.COM (M
Date: Mon, 22 Feb 1988 00:47
Date: Mon, 22 Feb 1988 00:47
170 lines
8890 bytes
8890 bytes
Several articles have recently appeared, alluding to a CMOS uP built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>, <9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>. These USENET articles mention that the chip, called the "GE RPM-40", runs a reduced instruction set, operates from 40 MHz clocks, and will be described at ISSCC (International Solid State Ciruits Conference) on February 17th. The paper has now been delivered and published. The authors were David Lewis, Theodore Wyman, Mark French, and Frederic Boericke (no acknowledgments were presented). Here are a few items of interest on the RPM-40, obtained from the oral presentation and the printed digest of technical papers. No analysis or critique is attempted; only a dump of raw data. The most noticeable unknowns are marked with a double asterisk **; perhaps others can fill in these gaps (if the data isn't secret). ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ GE RPM-40 CMOS MICROPROCESSOR 1. The chip was built under a DOD contract. It is one of several implementations under this contract. There are at least three: General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and Texas Instruments (GaAs Bipolar). Interestingly, they have each chosen a different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages. 2. The instruction set is "DARPA MIPS, core ISA (instruction set archictecture)". In the GE chip, instructions are 16 bits long. They are fetched from Instruction Memory two-at-a-time (making 32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec. Here is the summary chart of the instruction set: *************************************************************************** * 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 * * +-----------------------------------------------------------+ * * ALU | 0 0 | i | opcode | src1/dest | src2/imm | * * +-----------------------------------------------------------+ * * COND | 0 1 | i | test | src1 | src2/imm | * * +-----------------------------------------------------------+ * * LD | 1 0 0 | s | dest | base | offset | * * +-----------------------------------------------------------+ * * ST | 1 0 1 | s | source | base | offset | * * +-----------------------------------------------------------+ * * XPLD | 1 1 0 0 | xp-field | base | offset | * * +-----------------------------------------------------------+ * * BRA | 1 1 0 1 | branch displacement | * * +-----------------------------------------------------------+ * * PFX | 1 1 1 0 | prefix-immediate | * * +-----------------------------------------------------------+ * * XPINS | 1 1 1 1 | co-processor instruction | * * +-----------------------------------------------------------+ * *************************************************************************** The ALU format has two register specifiers; presumably you can code "R3 := R4 + R3" but you cannot code "R3 := R4 + R1". The Store format has a source register, a base register, and a 4-bit offset field. Loads have a dest reg, a base reg, and a 4-bit offset. Branch instructions _seem_ to have only a 12-bit displacement field; there doesn't appear to be a "Branch Register", "Branch And Link", or "Conditional Branch" instruction. Perhaps the "COND" instruction is the conditional-skip instruction recently mentioned on the net**. ALU ops can have a 4-bit immediate field. If this is too small, the "PREFIX" instruction contains a 12-bit prefix that can be concatenated to the immediate, to create a 16-bit immediate value. Perhaps the PREFIX instruction can be used with loads, stores, and conditionals too. ** There are 21 32-bit registers; I _believe_ these are arranged as 16 general-purpose registers, plus 5 hardware stacks/queues (used in exception processing) that are mapped into the register space. ** 8-bit and 16-bit external data are converted into the internal 32-bit format by zero-fill (unsigned) or sign-extend (signed). This is to fulfill the DOD requirement for byte and halfword support. With only a single "s" bit in the opcode it is difficult to see how these instructions are encoded (load byte, load haldword, load word) "cross" (signed, unsigned). ** 3. A four-stage instruction pipeline is used (except for loads, see below): Instruction Fetch, Instruction Decode, ALU, Writeback. Address calculations (branch addresses or operand addresses) are performed in the ALU. 4. Performance with 40 MHz clocks is 40 million native RPM-40 opcodes per second. For the DOD, they benchmarked on a standard US Air Force mix of instrictions called the `DAIS Mix'. "The most pessimistic value on that mix is 14 MIPS", the speaker said. 5. The GE implementation uses a Harvard bus structure, with completely seperate Instruction Memory and Operand Memory. GE currently is using a total of 128Kbytes of memory: 16KWords of static RAM, each, for the IMem and OMem. Imem needs 50ns chips and Omem needs 25ns chips. At present there is no way to increase the amount of physical memory (e.g. with dynamic RAM). The speaker said that the CPU chip is intended for "embedded applications". 6. There is a "branch target instruction cache" which consists of 32 entries. Each entry holds 5 instructions (10 bytes). When a branch occurs, the chip looks (fully associatively) to see whether it holds the instruction at the branch target address in its cache. If a hit (target instruction present) occurs, then the branch target instruction, and the next 4 instructions, are read from the on-chip cache. Meanwhile the off-chip Imem is readying itself to begin delivering the 6th thru Nth instructions after the branch. Claimed hit rates of the branch target instruction cache are > 90%. On a miss there is a 3-cycle latency to get the Imem SRAM chips delivering instructions (and updating the b.t.i. cache). 7. The instruction memory contains a "lookahead counter". This lessens traffic on the address bus; instruction addresses only squirt out of the CPU after a branch .... leisurely reloading the counter while the branch target instruction cache supplies the 5 instructions after a branch. 8. Loads take 7 cycles while ALU operations take 4 cycles. If a program doesn't use the target register of a load until > 3 instructions after the load ("3 load delay slots" in some folks' parlance), then there is no interlock and instructions are issued one per cycle. If you use the target register of a load <= 3 cycles later, there is a pipeline stall while waiting for the Operand Memory to supply the data. Stores "can" operate at "up to" 1 per cycle. GE didn't discuss the constraints that prevent 1 store per cycle always, nor did they compare and contrast loads vs. stores. ** 9. Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction" opcode plus 12 bit coprocessor instruction type) are passed through the CPU, and sent over the address bus to the coprocessor(s). They can be stored in the branch target address cache. So it _appears_ that two cycles are required to do a coprocessor op, one to communicate it from the CPU to the coprocessor and one to do it **. GE didn't say whether there were architecturally-visible register files on the coprocessors **, but there _appears_ to be an "Xternal Processor Load" instruction **. The Floating Point coprocessor is in fab now and is expected out this month. 10. The CPU chip contains 92,000 transistors and is housed in a 132 pin package. The design style is fully static which is helpful for achieving radiation-hard parts. 40 pins are inputs, 46 pins are outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins & 7 Ground pins. No mention was made of whether this package configuration had been "certified" to run at 40 MHz, nor what agency would perform such certifications. ** The fab process is 1.2 micron bulk CMOS. 11. A simple virtual memory scheme called "most significant bit replacement" is used. A process-id is appended to the MSB's of an address before sending it out of the CPU. A special case occurs when those bits are all-0's or all-1's.... ** ** ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -Mark Johnson *** DISCLAIMER: The opinions above are personal. *** UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-991-0208 US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: bcase@apple.UUCP
Date: Mon, 22 Feb 1988 19:21
Date: Mon, 22 Feb 1988 19:21
25 lines
1161 bytes
1161 bytes
In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: > > >Several articles have recently appeared, alluding to a CMOS uP >built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>, ><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>. > >Branch instructions _seem_ to have only a 12-bit displacement field; >there doesn't appear to be a "Branch Register", "Branch And Link", >or "Conditional Branch" instruction. Perhaps the "COND" instruction >is the conditional-skip instruction recently mentioned on the net**. Allen Baum (who attended the conference) told me that the single branch instruction is only available in the conditional form. Thus, for an unconditional branch, you must make sure that you know the state of the single boolean bit (compares test a condition and set the state of the boolean bit). >11. A simple virtual memory scheme called "most significant bit replacement" > is used. A process-id is appended to the MSB's of an address before > sending it out of the CPU. A special case occurs when those bits >are all-0's or all-1's.... ** ** Isn't this the original Stanford MIPS scheme?
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: jesup@pawl20.paw
Date: Tue, 23 Feb 1988 08:22
Date: Tue, 23 Feb 1988 08:22
166 lines
8514 bytes
8514 bytes
In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: =These USENET articles mention that the chip, called the "GE RPM-40", =runs a reduced instruction set, operates from 40 MHz clocks, and will ¾ described at ISSCC (International Solid State Ciruits Conference) =on February 17th. =The =most noticeable unknowns are marked with a double asterisk **; =perhaps others can fill in these gaps (if the data isn't secret). To my knowlege, every thing I say in this article is public information. (I was on the RPM-40 software team for 1 year, until July 87.) =1. The chip was built under a DOD contract. It is one of several = implementations under this contract. There are at least three: = General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and =Texas Instruments (GaAs Bipolar). Interestingly, they have each chosen a =different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages. Also there's Sperry/UniSys (also CMOS). It's not suprising that the GaAs people use longer pipelines, they can't do much in that time, and are restricted on transistors. =2. The instruction set is "DARPA MIPS, core ISA (instruction set = archictecture)". In the GE chip, instructions are 16 bits long. = They are fetched from Instruction Memory two-at-a-time (making 2 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec. All the machines listed above are designed so that 'Core ISA' (a generic RISC assembly language, designed by Dr Gross of CMU) can be translated to their native assembly languages. =The ALU format has two register specifiers; presumably you can code ="R3 := R4 + R3" but you cannot code "R3 := R4 + R1". Correct, r3 = r4 + r1 becomes r3 = r4; r3 = r3 + r1. =The Store format has a source register, a base register, and a 4-bit =offset field. Loads have a dest reg, a base reg, and a 4-bit offset. =Branch instructions _seem_ to have only a 12-bit displacement field; =there doesn't appear to be a "Branch Register", "Branch And Link", =or "Conditional Branch" instruction. Perhaps the "COND" instruction =is the conditional-skip instruction recently mentioned on the net**. Any of those displacements can be prefixed by PFX instruction(s) to extend the displacement up to 32 bits. Yes, Cond conditionally skips the next instruction, they can be 'stacked' to provide complex conditionals. =ALU ops can have a 4-bit immediate field. If this is too small, the ="PREFIX" instruction contains a 12-bit prefix that can be concatenated =to the immediate, to create a 16-bit immediate value. Perhaps the =PREFIX instruction can be used with loads, stores, and conditionals =too. ** Yes, but you can use up to 3 prefixes to get 32 bit constants (in reality, 32 bits are not used very often.) =There are 21 32-bit registers; I _believe_ these are arranged as general-purpose registers, plus 5 hardware stacks/queues (used in =exception processing) that are mapped into the register space. ** Minor error, there are 21 gp registers, plus a number of special purpose registers, mostly reserved to supervisor mode. Several are stacks for internal state mapped into register slots. User available registers are the PC, Trap register, sr2 (has various flags), and the Size register (determines the size of non-word LD/ST, allows some register remapping, and a bit for doing 16-bit overflow detection instead of 32). =8-bit and 16-bit external data are converted into the internal 32-bit =format by zero-fill (unsigned) or sign-extend (signed). This is to =fulfill the DOD requirement for byte and halfword support. With only =a single "s" bit in the opcode it is difficult to see how these =instructions are encoded (load byte, load haldword, load word) "cross" =(signed, unsigned). ** There are state bits in the size register that control some of this. The 's' bit specifies "load word" or "load not word" (type defined by size bits, usually you're only playing with one non-word type). =4. Performance with 40 MHz clocks is 40 million native RPM-40 opcodes = per second. For the DOD, they benchmarked on a standard US Air Force = mix of instrictions called the `DAIS Mix'. "The most pessimistic =value on that mix is 14 MIPS", the speaker said. DAIS is a 1750a (Air Force Standard CPU) mix of instructions, the DAIS timings are heavily FPU dependant and are in 1750a MIPS, not RPM-40! =5. The GE implementation uses a Harvard bus structure, with completely = seperate Instruction Memory and Operand Memory. GE currently is = using a total of 128Kbytes of memory: 16KWords of static RAM, each, =for the IMem and OMem. Imem needs 50ns chips and Omem needs 25ns chips. =At present there is no way to increase the amount of physical memory =(e.g. with dynamic RAM). The speaker said that the CPU chip is intended =for "embedded applications". Well.... The current board has 128K, but the CPU supports full 32-bit addressing. Nothing says you can't put more than 128K on it, or use some sort of external cache. The only limits are the amount of capacitance the CPU can drive at 40 Mhz. =8. Loads take 7 cycles while ALU operations take 4 cycles. If a program = doesn't use the target register of a load until > 3 instructions after = the load ("3 load delay slots" in some folks' parlance), then there =is no interlock and instructions are issued one per cycle. If you use =the target register of a load <= 3 cycles later, there is a pipeline stall =while waiting for the Operand Memory to supply the data. That is only a software stall, eg NOP-insertion. Of course, the reorganizer will try to fill it. Note that the 7 & 4 cycle figures include all pipe stages, including the illusionary IF stage. =Stores "can" operate at "up to" 1 per cycle. GE didn't discuss the =constraints that prevent 1 store per cycle always, nor did they compare =and contrast loads vs. stores. ** There are some interlocks with other address-bus using instructions. You can string as many stores in a row you want, or as many loads. =9. Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction" = opcode plus 12 bit coprocessor instruction type) are passed through = the CPU, and sent over the address bus to the coprocessor(s). They Ên be stored in the branch target address cache. So it _appears_ that =two cycles are required to do a coprocessor op, one to communicate it =from the CPU to the coprocessor and one to do it **. GE didn't say =whether there were architecturally-visible register files on the =coprocessors **, but there _appears_ to be an "Xternal Processor Load" =instruction **. The Floating Point coprocessor is in fab now and is =expected out this month. The CPU doesn't have to wait, it just issues the instruction over the address bus. There is an XPLoad instruction, coprocessor dependant. . A simple virtual memory scheme called "most significant bit replacement" = is used. A process-id is appended to the MSB's of an address before = sending it out of the CPU. A special case occurs when those bits =are all-0's or all-1's.... ** ** Tasks can be allocated memory under this scheme in power-of-two sized chunks == 256 bytes. Of course, instructions and data have different mappings. =++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ =-Mark Johnson *** DISCLAIMER: The opinions above are personal. *** =UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-991-0208 =US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 I hate to admit this, but it was decided that Core ISA mandated little-endian memory layout, since several other Core ISA users had implemented their CPUs that way already when we questioned it. (Will little-endianism dog out heels forever? :-) VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9 16Mhz 68020's with 0 wait-state memory and no MMU delay. (Not your standard unix box envirionment 68020.) { WARNING: this is VERY ROUGH, and though I have calulations available that say this, they are very back-of-napkin style! However, it's probably not TOO far off. Maybe we'll have real performance figures at some point from GE (I don't work there anymore). } // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: oconnor@sunset.s
Date: Tue, 23 Feb 1988 16:06
Date: Tue, 23 Feb 1988 16:06
39 lines
1739 bytes
1739 bytes
An article by bcase@apple.UUCP (Brian Case) says: ] In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: ] > ] > ] >Several articles have recently appeared, alluding to a CMOS uP ] >built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>, ] ><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>. ] > ] >Branch instructions _seem_ to have only a 12-bit displacement field; ] >there doesn't appear to be a "Branch Register", "Branch And Link", ] >or "Conditional Branch" instruction. Perhaps the "COND" instruction ] >is the conditional-skip instruction recently mentioned on the net**. ] ] Allen Baum (who attended the conference) told me that the single branch ] instruction is only available in the conditional form. Thus, for ] an unconditional branch, you must make sure that you know the state of ] the single boolean bit (compares test a condition and set the state of ] the boolean bit). Allen Baum has misinterpretted. Branches are conditional-ized just like any other instruction (except PREFIX). If and only if the branch (and its PREFIXes, if any) are preceeded by one or more COND instructions (and there PREFIXes, if any) is the branch conditional. ] >11. A simple virtual memory scheme called "most significant bit replacement" ] > is used. A process-id is appended to the MSB's of an address before ] > sending it out of the CPU. A special case occurs when those bits ] >are all-0's or all-1's.... ** ** ] ] Isn't this the original Stanford MIPS scheme? It's an enhancement of the original Stanford scheme. -- Dennis O'Connor oconnor@sunset.steinmetz.UUCP ?? ARPA: OCONNORDM@ge-crd.arpa "Nuclear War is NOT the worst thing people can do to this planet."
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: mash@mips.COM (J
Date: Wed, 24 Feb 1988 05:20
Date: Wed, 24 Feb 1988 05:20
37 lines
1819 bytes
1819 bytes
In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: .. >=or "Conditional Branch" instruction. Perhaps the "COND" instruction >=is the conditional-skip instruction recently mentioned on the net**. > Any of those displacements can be prefixed by PFX instruction(s) >to extend the displacement up to 32 bits. Yes, Cond conditionally skips >the next instruction, they can be 'stacked' to provide complex conditionals. I assume that cond skips the next instruction, including the PFX's?? > Minor error, there are 21 gp registers, plus a number of special >purpose registers, mostly reserved to supervisor mode. Several are stacks >for internal state mapped into register slots. User available registers >are the PC, Trap register, sr2 (has various flags), and the Size register >(determines the size of non-word LD/ST, allows some register remapping, >and a bit for doing 16-bit overflow detection instead of 32). How do you address more than 16 gp regs, given the encoding? > VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9 >16Mhz 68020's with 0 wait-state memory and no MMU delay. (Not your standard >unix box envirionment 68020.) I.e., assuming that such 68020s are around 2 vax-mips, this sounds like about 14-18 vax-mips, roughly. >{ WARNING: this is VERY ROUGH, and though I have calulations available that > say this, they are very back-of-napkin style! However, it's > probably not TOO far off. Maybe we'll have real performance > figures at some point from GE (I don't work there anymore). } -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: aglew@ccvaxa.UUC
Date: Wed, 24 Feb 1988 17:23
Date: Wed, 24 Feb 1988 17:23
16 lines
762 bytes
762 bytes
>=2. The instruction set is "DARPA MIPS, core ISA (instruction set >= architecture)". In the GE chip, instructions are 16 bits long. >= They are fetched from Instruction Memory two-at-a-time (making >2 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec. > > All the machines listed above are designed so that 'Core ISA' (a >generic RISC assembly language, designed by Dr Gross of CMU) can be translated >to their native assembly languages. Okay, what about this MIPS-like ISA? Will it be assembly language only, or binary? Will it be possible to run some form of program intermediate between C and actual assembly through a translator to move between these families - and will third party software vendors distribute that portable form?
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: aglew@ccvaxa.UUC
Date: Fri, 26 Feb 1988 16:52
Date: Fri, 26 Feb 1988 16:52
35 lines
1553 bytes
1553 bytes
.> Prefix instructions in the GE RPM-40 I like this idea. (I should - I used it in a school project back in '84, before I knew details of the Transputer - I think I got it from an earlier architecture, melded with the 8088's PREFIX instructions.) I particularly like how it begins to let the instruction set get independent of the register size (so long as people do not expect 1<<32 == 0) A question, though: how would you compare PREFIX to an instruction SHIFT and OR -- SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually require a specification for one of several literal fields it is extending, plus it requires state to be saved on interrupts, which leans towards assembling the constant in a register. On the other hand, you can always build a decoder that never puts prefix into a register at all, but takes prefix and the prefixed instruction as one packet. This is nice, and makes it a pity to require the register write. What do people (particularly the RPM-40 people) feel on this? Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:29
Date: Mon, 29 Feb 1988 09:29
63 lines
2845 bytes
2845 bytes
In article <1666@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: >> Any of those displacements can be prefixed by PFX instruction(s) >>to extend the displacement up to 32 bits. Yes, Cond conditionally skips >>the next instruction, they can be 'stacked' to provide complex conditionals. >I assume that cond skips the next instruction, including the PFX's?? Yup. The COND instruction actually skips the next (non-PFX,non-COND) instruction. Essentially, it acts as though PFX is part of the instruction after it. Example: COND GT,.r1,.r2 PFX #$xxx COND GE,.r1,#$yy PFX #$qqq ADD .r1,#zz MOV .r2,.r1 If the either cond fails, control goes to the MOV instruction. Of course, you would write PFX's in yourself, the assembler does them for you auto- magicly. >> Minor error, there are 21 gp registers, plus a number of special >>purpose registers, mostly reserved to supervisor mode. Several are stacks >>for internal state mapped into register slots. User available registers >>are the PC, Trap register, sr2 (has various flags), and the Size register >>(determines the size of non-word LD/ST, allows some register remapping, >>and a bit for doing 16-bit overflow detection instead of 32). > >How do you address more than 16 gp regs, given the encoding? In general, the destination of ALU ops can be any register 0-31. However, for most ALU ops the source must be in regs 0-15. There are two ways around this: 1) There are two instructions that reverse the meanings of "source" and "destination". These are RMOV (reverse move) and RADD (reverse add). These allow moving the higher registers to the lower or adding them into the lower (two high-freqency ops). 2) There is a bit that allows swapping of the regs 8-13 and regs 16-21. Note that loads and stores also must only use regs 0-15. There is no guarantee the higher registers will be extremely useful, but they are very useful for things like temps, or passing args, or accumulators, etc. The swap feature can make them much more useful, but requires more work to use. >> VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9 >>16Mhz 68020's with 0 wait-state memory and no MMU delay. (Not your standard >>unix box envirionment 68020.) > >I.e., assuming that such 68020s are around 2 vax-mips, this sounds like >about 14-18 vax-mips, roughly. That seems to jibe fairly well. Of course, only real benchmarks will tell the story, and those depend on compiler tech quite a bit. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:46
Date: Mon, 29 Feb 1988 09:46
27 lines
1380 bytes
1380 bytes
In article <28200110@ccvaxa> aglew@ccvaxa.UUCP writes: >> All the machines listed above are designed so that 'Core ISA' (a >>generic RISC assembly language, designed by Dr Gross of CMU) can be translated >>to their native assembly languages. > >Okay, what about this MIPS-like ISA? Will it be assembly language only, >or binary? Will it be possible to run some form of program intermediate >between C and actual assembly through a translator to move between these >families - and will third party software vendors distribute that portable >form? Core ISA is an assembly language for a non-existant machine. It is fairly 'RISCy', but includes things like multiply (integer and FP) as single ops, etc. It has no relation to ANY existant hardware at all, and was designed explicitly for the Darpa MIPS project. Anything distributed in Core ISA is portable (at least potentially). All the machines mentioned have Core_ISA->their_assembler translators. However, I suspect most stuff will be distributed in source (the compilers produce Core ISA, that's the point of it). Assembler modules should all be written in Core as well. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: jesup@pawl3.pawl
Date: Mon, 29 Feb 1988 09:58
Date: Mon, 29 Feb 1988 09:58
35 lines
1746 bytes
1746 bytes
In article <28200112@ccvaxa> aglew@ccvaxa.UUCP writes: >..> Prefix instructions in the GE RPM-40 >A question, though: how would you compare PREFIX to an instruction SHIFT and >OR -- SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually >require a specification for one of several literal fields it is extending, >plus it requires state to be saved on interrupts, which leans towards >assembling the constant in a register. Pipelining! You can't use the result of an op in the next instruction! So you'd have to devote both a register AND intersperse NOPs between SHORs. However, on a machine with loopback of ALU results (may slow things down) it only costs a register, so it doesn't hurt TOO much (if you have registers to spare, which you very well might not). What are these 'several fields' you refer to? RPM-40 can only have 1 value that might be extended via prefix in any instruction (immediates for ALU and COND ops, offset for load/store/branch, xp instruction field for XPINST, etc.) > On the other hand, you can always build a decoder that never puts prefix >into a register at all, but takes prefix and the prefixed instruction as >one packet. This is nice, and makes it a pity to require the register write. RPM-40 does that now, but handles each prefix as it comes along (there are some hidden resources being used). What you imply would complicate the decoder a lot. >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: oconnor@sunset.s
Date: Mon, 29 Feb 1988 18:04
Date: Mon, 29 Feb 1988 18:04
48 lines
2047 bytes
2047 bytes
An article by aglew@ccvaxa.UUCP says: ] ] ..> Prefix instructions in the GE RPM-40 ] ] I like this idea. ] [...] ] I particularly like how it begins to let the instruction set get independent ] of the register size (so long as people do not expect 1<<32 == 0) ] ] A question, though: how would you compare PREFIX to an instruction SHIFT and ] OR -- SHOR r,lit ::== r := (r<<14)|lit? PREFIX builds immidiate values that can then be added, ored, subtracted or whatever to anything you like. It does not use a user register to do this (minor win). And it does NOT access the register file, or use the ALU. In a pipelined system this is significant : PREFIX as implimented in RPM40 have no latency problems (major win). SHOR would have latency problems. ] PREFIX always seems to eventually ] require a specification for one of several literal fields it is extending, ] plus it requires state to be saved on interrupts, which leans towards ] assembling the constant in a register. RPM40 instructions only have one field that can possibly be an immediate operand, why more ? Any operations on two constants should be done at compile or load time, I think. Given you have a reverse-subtract instruction ( normal = op1-op2, reverse = op2-op1 ) I don't see the need for two "immidiate-able" fields. Yes, the prefix register needs to be saved on a context switch, and in fact has to have a old value saved. This is not really a big deal. ] On the other hand, you can always build a decoder that never puts prefix ] into a register at all, but takes prefix and the prefixed instruction as ] one packet. This is nice, and makes it a pity to require the register write. This is a good idea, especially if you can fetch instructions faster than you can execute (non-PREFIX) instructions. ] Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 -- Dennis O'Connor UUNET!steinmetz!sunset!oconnor ARPA: OCONNORDM@ge-crd.arpa (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: mash@mips.COM (J
Date: Tue, 01 Mar 1988 09:02
Date: Tue, 01 Mar 1988 09:02
26 lines
1183 bytes
1183 bytes
In article <9727@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes: .. >] A question, though: how would you compare PREFIX to an instruction SHIFT and >] OR -- SHOR r,lit ::== r := (r<<14)|lit? > >PREFIX builds immidiate values that can then be added, ored, >subtracted or whatever to anything you like. It does not use >a user register to do this (minor win). And it does NOT access >the register file, or use the ALU. In a pipelined system >this is significant : PREFIX as implimented in RPM40 have no latency >problems (major win). SHOR would have latency problems. Why would it have latency problems? None of the popular RISCs have latency problems with r = r op literal for the usual ops. I.e., any high-performance system is likely to make use of register-bypassing anyway, so that: r = r op literal r = r op r has zero intervening latency (the performance penalty of a cycle's latency for such things is large). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: aglew@ccvaxa.UUC
Date: Tue, 01 Mar 1988 16:23
Date: Tue, 01 Mar 1988 16:23
10 lines
515 bytes
515 bytes
> Ever seen a multiply or divide as 1 instruction in a RISC? No, of >course they are not there. No direct support on CPU for them either. I will >say more on this issue when the FPU is formally announced. You can do them >in the CPU in software if you want, takes a few cycles though. If your customers spend time doing multiplies or divides, then your RISC designer will put them in. Cray is the only "RISCy" machine that is widely known with multiply that springs to mind, though. Same for floating point.
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: davidsen@steinme
Date: Tue, 01 Mar 1988 17:35
Date: Tue, 01 Mar 1988 17:35
22 lines
1009 bytes
1009 bytes
In article <444@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: | [...] | Core ISA is an assembly language for a non-existant machine. It | is fairly 'RISCy', but includes things like multiply (integer and FP) as | single ops, etc. It has no relation to ANY existant hardware at all, and was | designed explicitly for the Darpa MIPS project. | | Anything distributed in Core ISA is portable (at least potentially). | All the machines mentioned have Core_ISA->their_assembler translators. If it clarifies the situation, ISA is functionally similar to the old UCSD P-system, and I don't see any technical reason why it couldn't be interpreted instead of translated and compiled. For history bufs, the original "B" language compiler compiled to P-code, which was then used to generate assembler. We had a P-code interpreter on several machines. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: tim@amdcad.AMD.C
Date: Tue, 01 Mar 1988 20:18
Date: Tue, 01 Mar 1988 20:18
19 lines
795 bytes
795 bytes
In article <445@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: | Pipelining! You can't use the result of an op in the next | instruction! So you'd have to devote both a register AND intersperse NOPs | between SHORs. However, on a machine with loopback of ALU results (may | slow things down) it only costs a register, so it doesn't hurt TOO much | (if you have registers to spare, which you very well might not). Interesting... this is the first RISC processor I have heard of that did not implement operand {forwarding/bypassing/other names?} around the ALU. What prompted the elimination of this feature? Do you have any statistics on how many additional nops/stalls are required? Thanks for any info... -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: bron@olympus.SGI
Date: Thu, 03 Mar 1988 21:34
Date: Thu, 03 Mar 1988 21:34
13 lines
640 bytes
640 bytes
In article <28200116@ccvaxa>, aglew@ccvaxa.UUCP writes: > If your customers spend time doing multiplies or divides, then your RISC > designer will put them in. Cray is the only "RISCy" machine that is widely > known with multiply that springs to mind, though. Same for floating point. FYI, The Cray XMP machines do NOT have hardware support for a general integer (64bit) multiply. They can do address length (24bit) integer multiplies. It has no hardware for integer divide (of any length). If you need these operations, you have to convert to floating point. ------ Bron Nelson bron@sgi.com Don't blame my employers for my opinions.
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: wcs@ho95e.ATT.CO
Date: Fri, 04 Mar 1988 05:22
Date: Fri, 04 Mar 1988 05:22
28 lines
1501 bytes
1501 bytes
In article <28200116@ccvaxa> aglew@ccvaxa.UUCP writes: : :> Ever seen a multiply or divide as 1 instruction in a RISC? No, of :>course they are not there. No direct support on CPU for them either. I will :>say more on this issue when the FPU is formally announced. You can do them :>in the CPU in software if you want, takes a few cycles though. : :If your customers spend time doing multiplies or divides, then your RISC :designer will put them in. Cray is the only "RISCy" machine that is widely :known with multiply that springs to mind, though. Same for floating point. The AT&T Digital Signal Processor chips are RISCy, and do single-instruction multiplies, because that's what the chips' customers do. The DSP-32 does 32-bit floating point - each cycle does an add and a multiply if you want them, and/or 16-bit integer ops; I think the pipeline is 4 deep for multiplies. The original chip did 4 Million cycles/sec (16MHz clock?); the current version does 6 Million. The next generation will be faster. The current chip also includes serial and parallel I/O hardware, but only 64K address space; the next will be more general. The DSP-16 does 16-bit integers (multiplies into 36 bits); it's got very limited memory (1-4K on chip), and has a more limited instruction set, but the 16 - 19 million cycles/sec do a multiply and/or add as well as separate integer ops for address calculation. -- # Thanks; # Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs
Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Author: kers@otter.hple.
Date: Fri, 04 Mar 1988 08:47
Date: Fri, 04 Mar 1988 08:47
18 lines
754 bytes
754 bytes
Version 2 of the Acorn Risc Machine has two multiply instructions (one with, one without, accumulate), but no divide instruction. At a seminar I attended, the designer* said that (a) they could fit it on the chip, and (b) it afforded enough performance increase to be an acceptable overhead (rather than having a multiply-step, or doing it with shift-and-add). Mildly surprising, considering the shiftable-register-source in the data manipulation instructions (gives you multiplies by constants of the form 2^n, 2^(n+1), 2^(n-1) in one instruction). Could it be something to do with having interpreted BBC Basic as a principal language, so there isn't a compiler to notice that E*K can be done speedily? Regards, Kers. * well, one of the designers.
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: jesup@pawl23.paw
Date: Sat, 05 Mar 1988 07:52
Date: Sat, 05 Mar 1988 07:52
37 lines
1845 bytes
1845 bytes
In article <1729@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >I.e., any high-performance system is likely to make use of >register-bypassing anyway, so that: > r = r op literal > r = r op r >has zero intervening latency (the performance penalty of a >cycle's latency for such things is large). >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> Two reasons why one might not have register bypassing: 1) Slows down critical path. Any finely tuned risc CPU will most probably have it's cycle time determined by the latency through the ALU. Using a loopback of ALU results might result (depending on layout, tech, etc) in up to a 20% slowdown in the ALU, plus increase the chip area and layout problems. This doesn't mean a loopback is a loss necessarily, but that it does have a measurable cost which must be weighed against the benefits. 2) In combination with (1) above, I'm not sure that having a one-cycle delay in ALU results causes any large loss. A good reorganizer can fill those latencies, or move the ALU op forward into, for example, a load delay. In high-speed (> 15 Mhz) RISCs (and maybe slower ones as well), load delays are usually the determining factor, or a large part of it. What studies do you have that compare RISC's with 1 cycles ALU delays and 0-cycle? I'd like to see anything you can drag up. 3) If one is doing much FP, the CPU is usually waiting on results from the FPU anyway, so you may not lose anything. (I know I said 2, but....) // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
Re: RPM-40 microprocessor @ 40 MHz; dat
Author: aglew@ccvaxa.UUC
Date: Fri, 11 Mar 1988 16:11
Date: Fri, 11 Mar 1988 16:11
9 lines
252 bytes
252 bytes
>Which is not the standard Unix >profile (except for things like Crays). Ayoi! You don't have to buy a multimillion dollar supercomputer to get a floating point oriented system that runs UNIX. Consider Gould (and, to be fair, Alliant, Convex, etc.)
Thread Navigation
This is a paginated view of messages in the thread with full content displayed inline.
Messages are displayed in chronological order, with the original post highlighted in green.
Use pagination controls to navigate through all messages in large threads.
Back to All Threads