Stochastic Computing Architecture for Efficient Use of TinyML
January 11, 2022
Neural networks are a popular machine learning model, but they demand higher energy consumption and more complex hardware design. Stochastic computing is an efficient way to balance the trade-off between hardware efficiency and computing performance. However, stochastic computing witnesses low accuracy of ML workloads due to arithmetic units' low data precision and inaccuracy.
To solve the problems related to the traditional stochastic computing method and increase the performance through higher accuracy and lower power consumption, ongoing research proposes a modified block-based stochastic computing architecture. With the introduction of blocks in the input layer, the latency can be reduced by exploiting high data parallelism. It is more important to determine the number of blocks required, which the global optimization approach takes care of.
The existing method includes increasing the length of bitstreams to improve data precision, even using exponential bits to obtain accurate results. However, this has introduced long computing latency, which is not reasonable for TinyML applications. So, to deal with this rising computing latency, bitstreams are divided into blocks and then executed in parallel. Incorporating intra-block arithmetic units and the Output Revision (OUR) scheme mitigates the inter-block inaccuracy problems to provide high computing efficiency.
Block-Based Stochastic Computing Architecture
Research provides a novel architecture where the inputs are divided into blocks and execute multiplication and addition in parallel using optimized intra-block arithmetic units. Furthermore, the proposed model is an unexcelled architecture when it comes to latency-power trade-off for TinyML applications.
[Image Credit: Research Paper]
The architecture is divided as follows:
As shown in the above figure, the input bitstream is divided into ‘k’ value blocks. The proposed idea is that selecting a good number of blocks for the bitstream is not guaranteed to be optimal but can be employed for a close approximation. If there is an error choosing the number of blocks, this can contradictorily incur large errors. There is complex computation in determining the probability of two average values of the positive and negative parts from the input bitstreams. View the research paper, "BSC: Block-based Stochastic Computing to Enable Accurate and Efficient TinyML" for more details on the Heuristic Strategy for Block Division.
Mitigating the problems faced by the traditional adders suffered correlation problems in the OR adder and overflow problems in the separated adder. The new modified architecture design comes with XNOR+AND gates between inputs to eliminate the correlation of bipolar computing. The diagram below illustrates intra-block computation.
[Image Credit: Research Paper]
Each input bit is taken in the parallel counter (PC), which is processed separately for positive and negative parts (Ap, An). There are two dedicated accumulators for the processing of the signed bits. After the input bit is taken, subtraction between the accumulators occurs, as illustrated by the positive and negative parts. The goal is to get the number of accumulated 1s of all inputs. Further, one bit of the temporal outputs (Sop, Son) is taken by comparison, and after multiple ‘n’ cycles, the sign bit is computed and selected the output result from Sop and Son based on the signed bits Ap and An.
This new accumulator-based adder for sign-magnitude format takes advantage of the uNSADD adder to compare the real accumulated 1s in outputs as well as inputs to determine the output bit. This method eliminates the impact of correlation and quick overflow issues.
Inter-Block Output Revision Scheme
Even though the intra-block adder solves the correlation and overflow problems, the block division introduces a new inter-block inaccuracy error. This does not happen for multipliers since the inputs are XNORed and ANDed. But for adder, the number of 1s in the output can deviate from the resulting inaccuracy. The Output Revision Scheme adds or removes 1s after the parallel intra-block computation stage without introducing any additional latency to solve these inter-block inaccuracy errors.
The novel block-based stochastic computing architecture aims to improve the accuracy of the stochastic computing arithmetic circuit while reducing the computing latency and energy efficiency. According to the findings, the method achieves over 10% higher accuracy than the existing methodologies and saves more than 6x power.