Managing Microarchitectural Timing Violations with Hardware Transactional Memory

Our prior work considered how Hardware Transactional Memory (HTM) could be implemented in a lightweight fashion on embedded systems with coherence-free many-core architectures. Now, as a means of managing this guardband pessimism, we proposed a novel HW/SW technique that relies on this same HTM rollback mechanisms for error correction in errant transactions. Hardware Transactional Memory was originally proposed for managing memory synchronization in multiprocessor systems by providing a means to speculate about shared data protection to improve program runtime performance. Our approach replaces traditional conflict detection logic with simpler architectural support for error detection and employs error management policies that aggressively apply dynamic voltage scaling (DVS) beyond the point of first failure for better energy savings. The policy monitors transaction aborts and commits to estimate the experienced error rate and decides whether to lower, maintain or raise the voltage level.

With this proposed technique comes the extension of using the same HTM framework to manage approximate computing. Approximate computing has emerged as a promising solution to these dilemmas for applications that can sustain a slightly reduced accuracy for increases in performance and energy efficiency; however, managing this approximation dynamically within an application can be a challenge of its own. If not done correctly, approximations may lead to unacceptable quality loss, or worse, it can affect critical data and damage the control flow of the program. Our same HTM-inspired framework provides a novel error management scheme that tolerates (i.e., opportunistically ignores) timing violations, allowing for more aggressive voltage scaling. Dynamically deciding which timing violations to ignore relies on careful evaluation of the application running of the system as well as developing an accurate error model to capture the error behavior within the processor computation flow. In our work, our error model takes into account value correlation, computation history, and the critical path of the computation to more accurately determine if a particular error in space and time is critical or not. We then utilize a combination of static and dynamic monitors to determine appropriate conditions for voltage adjustments within tolerable bounds based on this error analysis . The key insight is that recovery from critical errors, ones that cannot be tolerated, can be facilitated by lightweight mechanisms adapted from hardware transactional memory (HTM) to optimize energy savings while retaining similar runtime performance at acceptable accuracy loss. Our experimental results show our approach allows up to 47% total energy savings with negligible impact on runtime.

We note that our approach requires special circuitry to detect timing errors during program execution and our error models use assumptions in the hardware design to evaluate the impact of errors on application accuracy. Once the models are determined, the effect of errors on the accuracy of a particular application requires profiling the application offline to characterize which instructions are amenable to approximation and the extent to which the application can tolerate errors. Of course, this puts some extra burden on the user for understanding what can be approximated and what testbenches are appropriately representative of actual application use. On the other hand, the hardware infrastructure we propose does not change with the error model or the results of the error analysis; any updates to error models or testbenches will not changes in our hardware. This is distinct from other works that propose special approximate hardware circuits as part of the design.

Here are some of my publications related to the topic.
  • Hardware Transactional Memory Exploration in Coherence-Free Many-Core Architectures. IJPP 2018. PDF
  • Edge-TM: Exploiting Transactional Memory for Error Tolerance and Energy Efficiency. ACM TECS 2017. PDF
  • Evaluating Critical Bits in Arithmetic Operations due to Timing Violations. HPEC 2017. PDF
  • IgnoreTM: Opportunistically Ignoring Timing Violations for Energy Savings using HTM. DATE 2019. PDF