[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Universal Processor (was Re: [oc] x86 IP Core)

This kind of "translation" could not be used for optimising the use of one
cpu ?

Some interface could be very complexe, and the size of the FSM needed
could be much bigger than a small risc processor and it's memory. Could it
possible to adapt a small cpu (size of the register set, type of
instructions,...) and generate the binary code in the same time from a C
code ?

For the compression stuff, i had an idea, juste to save bandwith a little
bit. Could it be possible to compress L2 cache line ? So at each L2 cache
miss, the line is loaded and decompressed but because it's compressed, you
could save some memory bus cycle. I imagine that the size of the cache
line could be too udge to be interresting.

Nicolas Boulay

>> I have been thinking along those lines for quite some
>> time, but have come to the conclusion that even though
>> this sounds kind of nice it is not particularly practical.
>> You would have a CPU (e.g. OpenRisc) and a translation
>> block. Which means the size, and power consumption is
>> much higher than that of a native x86 or 68K etc.
>> Additional problems occur in latency of the code morphing
>> block. Think when you take a branch, you would have to
>> stall until the code morphing block can catch up.
>> Several companies have taken this a step further and
>> ave removed the traditional instruction decode block
>> from the CPU and are capable of creating custom
>> instruction decode units based on your requirement
>> (x86, 68K etc.)
>> So a CPU that does that might be more interesting than
>> a translation block.
>> Other companies also have written "Just In Time" compilers
>> and translates that convert binaries in software to the
>> desired architecture. But thats obviously very performance
>> degrading unless you are running one very, very long task
>> at a time ...
> Like you two already mentioned there has been done a lot of
> research/work on  this subject.
> I was trying to use this idea for power saving issues. The idea is to
> use  compressed code and decompress it on the fly into internal L0 cache
> (e.g. 32  locations only). It is also possible to optimize this code
> quite easily,  since 'microcode' is very orthogonal and you can quickly
> see which units are  used. (I am using 'microcode' here meaning control
> signals for 1 clock for  the CPU pipe).
> It is a bit tricky however to keep addresses of previous instructions in
> the  cache here.
> Also it would be really nice if microcode would be superset of all the
> instructions of all the CPUs you are trying to simulate e.g. number of
> registers. Otherwise you have to use JIT.
> If you do decode with cache line refill (L0 cache), you have additional
> latency just with cache misses with mispredicted branches.
> Note that (if L0 is properly sized) this should actually save power.
> best regards,
> Marko
> --
> To unsubscribe from cores mailing list please visit
> http://www.opencores.org/mailinglists.shtml

To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml