Confluence of Data, Computing, and Storage
In traditional information processing systems, different system components are often developed agnostic of each other. While this viewpoint offers useful abstractions, the incurred inefficiencies are increasingly more costly.
In this research we explore a fundamentally new viewpoint: we design system components (codes and algorithms for storage) in data-aware way. We are actively exploring several applications of this philosophy including fault tolerant memories and context-aware coding for machine learning.
Error-correcting codes and system-level fault-tolerance techniques have historically been developed as separate abstractions in the hardware/software stack. Conventional codes are designed to be agnostic to the content of their data payloads, while fault tolerance techniques do not leverage knowledge of the underlying code construction to recover from uncorrectable errors. We have created software-defined error-correcting codes (SWD-ECC), a new error-correction technique that co-designs the ECC scheme alongside system-level fault-tolerance mechanisms to enable heuristic recovery for previously uncorrectable errors.
The key idea in SWD-ECC is that side information can be used to heuristically recover from detected-but-uncorrectable errors (DUEs) by trying to correctly estimate the original uncorrupted message. Our techniques allow us to push past the traditional boundaries of ECCs; when an error is detected, but uncorrectable, SWD-ECC allows us to probabilistically decode using side-information. Although our studies are tailored to memory, the ideas can be applied to storage, communications, and information theory as well. The approach will benefit computing from embedded and mobile to the cloud and supercomputing domains.
Recently, we evaluated our SWD-ECC techniques by heuristically recovering from 2-bit DUEs applied to the MIPS instruction set. We performed the offline analysis on SPEC CPU2006 benchmarks using a single-error-correcting double-error-detecting (SEC-DED) underlying code. We were able to recover from 34% of errors that would have previously been unrecoverable, often resulting in catastrophic crashes! The only side-information used was the legality and frequency of the instruction bits. Using other side-information, such as data correlation, will yield even better results.
- C. Schoeny, F. Sala, M. Gottscho, I. Alam, P. Gupta, and L. Dolecek, “Context-Aware Resiliency: Unequal Message Protection for Random-Access Memories,” IEEE Transactions on Information Theory, vol. 65 (10), pp. 6146 — 6159, Oct. 2019.
- M. Gottscho, I. Alam, C. Schoeny, L. Dolecek, P. Gupta, “Low-Cost Memory Fault Tolerance for IoT Devices,” ACM Transactions on Embedded Computing Systems, vol. 16(5). pp. 128:1– 128:25, Nov. 2017. Best Paper Award at the ACM/IEEE Int. Conference on Compilers, Architecture, and System Synthesis (CASES)
- M. Gottscho, C. Schoeny, L. Dolecek, P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, Jun. 28-Jul. 1, 2016, (to appear).
- I. Alam, C. Schoeny, L. Dolecek, and P. Gupta, “Parity++: Lightweight Error Correction for Last Level Caches”, Workshop on Silicon Errors in Logic System Effects (SELSE), March 2018. Best of SELSE — also presented at DSN 2018.
- M. Gottscho, C. Schoeny, L. Dolecek, P. Gupta, “Software-Defined Error-Correcting Codes,” in IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), Austin, TX, Mar. 29-30, 2016. Best of SELSE — also presented at DSN 2016. (top 3 selected)
- C. Schoeny & M. Gottscho, “Software-Defined Error-Correcting Codes,” Qualcomm Innovation Fellowship, San Diego, CA, Mar. 22-23, 2016. Winner of fellowship (top 8 selected).