Deutsche Version dieser Seite
Next:Literature  Up:Design  Previous:Synthesis  Contents


Multithreading

A multithreaded pipeline is one of the most important attributes of the Komodo-Mikrocontroller. That is why an evaluation of hardware costs for expanding the design to two, three, four or eight threads is given here.

Trying an implementation of a multithreaded pipeline results in a ressource problem: The single-threaded variant which can execute the instructions mentioned in section 4.4.3 without imul and idiv needs the whole XC 4036 XL chip. With an equivalent gate count of about 27200 it needs all of the 1296 configurable logic blocks (CLB) of the FPGA. So there is no capacity left although some more "unrelated logic" might be put into already used blocks. But then the routing problem will become worse.

The FPGA's CLB distribute to the functional blocks of the prototype like seen in table 6.2). All of those numbers differ a bit from synthesis to synthesis, and one configurable logic block can be used several times as long as functionality is not spoiled. This can be a explanation for value 0 in column #CLB of row MRU.

Table 6.2: Space needed for the pipeline
Funktional block #CLB #FF #Latches equivalent gate count
Interface (BMIU, buffers and inverters) 97 60 66 ca. 1200
IFU 60 78 0 ca. 1600
IWDU 445 151 40 ca. 6700
EXE 506 181 0 ca. 7760
EXEx 765 181 0 ca. 15600
WBU 27 43 0 ca. 550
SMU 154 279 0 ca. 8038
MRU 0 51 3 ca. 1214
SU, PMU 7 10 0 ca. 84
gesamt 1296 853 109 ca. 27200

Table 6.2 contains two rows for the execution unit: EXE shows the results of an execution unit with instructions named in section 4.4.3 without imul and idiv, but with the added instructions read_global0, read_global1, write_pc, write_optop, write_global0 und write_global1. EXEx contains all instruction of EXE plus imul.

An evaluation for the size of an FPGA to implement a multithreaded pipeline can be given on base of the differences in size of the blocks per thread:

27200 gates are used for the single-threaded pipeline without multiplication and division. The stack memory consists of 32 memory cells that are 32 Bit wide with one write port and two read ports. Two microcodes are implemented in the Microcode-ROM Unit.

A very simple implementation of multithreaded functional blocks without any optimizations is used for the fetch stage, the instruction window & decode unit and for stack memory to evaluate average changes in ressource needs per thread. The results are shown in table 6.3. Costs for additional coding bits (thread tag) have been taken into account.

Table 6.3: Average ressource need per multithreaded block and thread
Funktion alblock dCLB dFF dLatches dequivalent gate count
IFUm 22 13 0 ca. 450
IWDUm 145 69 10 ca. 2200
SMUm 245 187 0 ca. 7000

While mapping the multithreaded pipelines into the FPGA the computer states that the chosen device is too small for those circuits. But it also states the ressource needs shown in table 6.4. With respect to the structure of the execution stage a differentiation into pipeline without multiplication (Mapping) and with multiplication (Mappingx) is done. In comparison to the single-threaded variant ressource needs for synthesis have changed: about 150 MB of RAM are needed. No runtime measurement can be done because of the device problem mentioned above.

Table 6.4: Space neede for multithreaded pipeline
Gesamtschaltung #Threads #CLB #FF #Latches #Gatteräquivalente
Mapping 1 1296 853 109 ca. 27200
Mappingx 1 1555 853 109 ca. 35000
Mapping 2 1539 1107 103 ca. 36100
Mappingx 2 1963 1107 103 ca. 44600
Mapping 3 2019 1380 120 ca. 47500
Mappingx 3 2436 1380 120 ca. 56000
Mapping 4 2103 1653 137 ca. 54500
Mappingx 4 2521 1653 137 ca. 63300
Prognose 8 4131 2743 190 ca. 86900
Prognosex 8 4390 2743 190 ca. 94800

The numbers for the eight-threaded microcontroller must be seen as a prognosis out of the averages per thread and coding tag bits.

The results are shown in figure 6.1 for a pipeline without multiplication. The equivalent gate count seems to rise almost linearly with the number of threads, but with an additional offset for every new coding bit of the thread tag. Therefore the difference to the one smaller number of threads is greater with three threads than with four. The next offset must be added when creating numbers for five threads, the following with nine. Taking this as a criterion, for best used ressources the number of threads should be a power of two.



Figure 6.1: Equivalent gate count per block and thread


Figure 6.2: Differences in equivalent gate counts per thread

To implement the single-threaded pipeline designed here in an FPGA without multiplication the XC 4036 XL is big enough. When moving to a four-threaded pipeline an XC 4085 XL with a maximum equivalent gate count of 85000 should be taken. Then the rest of 20000 equivalent gate counts might be used for hardware scheduling by the priority management unit and the signal unit.

When turning this design into a full microcontroller with e. g. an analog/digital-changer, RS232-, CAN-bus- and other interfaces, the additional ressource needs of these functional blocks have to be considered.

With respect to those arguments the decision could fall for a device out of Xilinx' Virtex family, that can replace 200000 gates at maximum (XCV200). The properties of embedded systems may lead to the low power variant XCV200E. Another point may be that a number of memory cells are used in the Komodo-Mikrocontroller. The extended memory type of virtex low power devices then seems to be interesting for further implementations. These are proposals based on technical reference (Xilinx, [31], a complete cost analysis cannot be done here.



Next:Literature  Up:Design  Previous:Synthesis  Contents
Robert Zulauf

2000-04-27