Brand M (2024)
Publication Type: Thesis
Publication year: 2024
URI: https://open.fau.de/handle/openfau/31038
With the decline of Moore’s law, the trend of processor architecture design shifts from powerful single-core processors to compensating for the stagnating increase of single-core frequency with the parallelism provided by many- and multi-core systems. For the increasing degree of parallelism, it also becomes more and more important to have access to small processing elements that are not only energy efficient but have a compact instruction set and use the costly instruction memory efficiently. This thesis presents a set of novel processor architectures that can be particularly beneficial when used in loop accelerator architectures like Tightly Coupled Processor Arrays (TCPAs) to not only save hardware but also energy and time under constraints for computational accuracy.
Already the current generation of processor-array-on-chip architectures, e.g., coarse-grained reconfigurable or programmable arrays, include more than 100s or even 1,000s of processing elements. Thus, it becomes essential to keep the on-chip configuration/instruction memories of each processing element as small as possible. Even compilers must take into account the scarceness of available instruction memory and create the code as compact as possible. However, compilers for Very Long Instruction Word (VLIW) processors have the well-known problem that they typically produce lengthy codes, especially for pipelined instruction sequences. Barely utilized Functional Units (FUs) as well as repeating operations of single FUs inside pipelined instruction sequences may lead to unnecessary or redundant code. Techniques like software pipelining can be used to improve the utilization of the FUs, yet with the risk of code explosion due to the overlapped scheduling of multiple loop iterations or other control flow statements. The proposed Orthogonal Instruction Processing (OIP) processor architecture by Brand et al. shows that the size of pipelined code of compute-intensive loop programs can be reduced significantly compared to the size of an equivalently pipelined VLIW program. The general concept of OIP is to have each FU process a microprogram orthogonally to each other but share control flow and access to the peripheral infrastructure. Contrary to VLIW processors, each FU is equipped with its own instruction memory, branch unit, and program counter. Each FU has access to shared register files and the flags from all functional units inside the processor. The synchronization of the microprograms may necessitate the repeated execution of single instructions for a fixed number of cycles. This can be encoded in the branch instruction and capsuled by only a single instruction to prevent redundant code. To utilize OIP processors to their full potential, they have to be programmed in a way that minimizes the idle time of FUs (e.g., due to data dependencies) and maximizes throughput. To solve this resource-constrained modulo scheduling problem, techniques based on mixed integer linear programming have been proposed. The necessary hardware extensions of OIP do not produce an overhead of resources, especially of needed instruction memory, compared to VLIW processors. Therefore, the architecture in conjunction with a set of benchmark applications has been analyzed and evaluated regarding program size, memory size, and overall architecture cost (based on a cost model by Müller et al.) in relation to VLIW processors. It could be shown that OIP produces no computational or memory overhead as soon as the program size of the OIP application is at least 50% less than the VLIW application. The necessary code size reduction is not only achieved for all investigated benchmarks, but depending on the application, the code size can even go down to 4.6% compared to its VLIW counterpart. Thus, expensive instruction memory can be saved which reduces the area and power requirements of an OIP processor in comparison to a VLIW processor with an equal number of functional units.
Besides memory requirements, many relevant applications from domains like image processing or machine learning are compute-intensive. But not always do they rely on perfectly accurate results, e.g., even though reasonably inaccurate computations of Convolutional Neural Networks (CNNs) may influence the exact probabilities of each class, they rarely change the final decision. With this motivation in mind, Anytime Instruction Processing (AIP) has been defined and investigated, a concept for programmable accuracy floating-point computations. AIP gives a programmer or compiler control over the accuracy of computed floating-point (FP) operations. The accuracy of the computations is encoded on bit granularity into the instruction, which leads to the executed operation only computing that many most significant bits (MSBs) and may even terminate earlier than when it had been computed at full accuracy. This is achieved by encoding an intended accuracy into the instruction’s opcode which specifically defines how many most significant mantissa bits of the FP operation shall be computed. Thus, only errors in the resulting mantissa are to be expected, while the resulting exponent and sign are computed accurately.
An anytime division capitalizes on the fact that divisions are classically already computed MSB first and is implemented by a non-restoring division that can terminate early based on the instructed accuracy. Implementations for the typically non-MSB first operations of addition and multiplication were presented in. One implementation uses on-line arithmetic to compute the addition or multiplication of the mantissa MSB first, and the alternative uses a bitmasking scheme to mask the least significant bits that should not be computed. In on-line arithmetic, a recurrence equation is derived for an operation in which the dependencies between two consecutive result bits are removed (e.g., the carry chain of an addition). Without those dependencies, each result bit can be computed independently of all others, enabling MSB first computation. By computing MSB first, on-line arithmetic also provides a high potential for pipelining: the execution of consecutive on-line instructions can start as soon as the first digit of the preceding instruction has been computed. Furthermore, operating in a redundant number format can help to reduce the complexity of the recurrence equation and the number of iterations required. Thus, binary operations in on-line arithmetic are usually performed in the redundant Signed Digit Radix-2 (SDR2) number format. The redundant FP number format SDFP has been defined based on SDR2 and enables the use of on-line arithmetics in anytime instructions. An alternative implementation of anytime additions and multiplications, the bitmasking approach, is based on masking the operand bits that do not contribute to the specified a MSBs of the result mantissa. In case of a multiplication, this masking is applied to the partial products, instead of directly to the operands to reduce a possible error. Compared to on-line arithmetic, computations may have a higher error, but the hardware overhead is negligible.
After integrating the anytime instruction paradigm into the C++ arbitrary precision framework Aarith, this framework was then used to evaluate the error behavior of anytime addition, multiplication, and division operations. Power and area were evaluated through the use of synthesis and simulation tools. The experiments clearly show a favor of computing iterative applications with anytime instructions. It is shown that the computation of an iterative Jacobi solver can on average save up to 39% of energy with a computational error of below 0.1% when compared to single-precision FP computations. Further, the applicability of anytime instructions for CNNs has been specifically investigated by setting an individual accuracy per layer of the inference of a ResNet-18 CNN implementation. It could be shown by a design space exploration that by using AIP, the energy of the inference can be reduced by up to 62%, again, compared to single-precision FP computations.
Besides the programmable accuracy provided by anytime instructions, reconfigurable-precision floating-point functional units have been investigated that not only provide different floating-point formats that can be selected dynamically but also vectorization and sub-word parallelism to potentially increase the performance of applications even further.
In summary, by using the concepts presented in this thesis implemented into the novel OIP processor architecture, it is possible to save not only hardware area and instruction memory but also tremendously energy and time when executing loop applications under constraints for computational accuracy, in specific circumstances even without reducing the result accuracy.
APA:
Brand, M. (2024). Approximate and Reconfigurable Precision Instruction Set Processors for Tightly Coupled Processor Arrays (Dissertation).
MLA:
Brand, Marcel. Approximate and Reconfigurable Precision Instruction Set Processors for Tightly Coupled Processor Arrays. Dissertation, 2024.
BibTeX: Download