Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045555
Fan Chen, Linghao Song, Hai Li, Yiran Chen
Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore’s Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pairwise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tuple anchors as inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.
{"title":"PARC: A Processing-in-CAM Architecture for Genomic Long Read Pairwise Alignment using ReRAM","authors":"Fan Chen, Linghao Song, Hai Li, Yiran Chen","doi":"10.1109/ASP-DAC47756.2020.9045555","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045555","url":null,"abstract":"Technological advances in long read sequences have greatly facilitated the development of genomics. However, managing and analyzing the raw genomic data that outpaces Moore’s Law requires extremely high computational efficiency. On the one hand, existing software solutions can take hundreds of CPU hours to complete human genome alignment. On the other hand, the recently proposed hardware platforms achieve low processing throughput with significant overhead. In this paper, we propose PARC, an Processing-in-Memory architecture for long read pairwise alignment leveraging emerging resistive CAM (content-addressable memory) to accelerate the bottleneck chaining step in DNA alignment. Chaining takes 2-tuple anchors as inputs and identifies a set of correlated anchors as potential alignment candidates. Unlike traditional main memory which organizes relational data structure in a linear address space, PARC stores tuples in two neighboring crossbar arrays with shared row decoder such that column-wise in-memory computational operations and row-wise memory accesses can be performed in-situ in a symmetric crossbar structure. Compared to both software tools and state-of-the-art accelerators, PARC shows significant improvement in alignment throughput and energy efficiency, thanks to the in-site computation capability and optimized data mapping.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127449712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045288
Yandong Luo, Shimeng Yu
Compute-in-memory (CIM) with emerging non-volatile memories (eNVMs) is time and energy efficient for deep neural network (DNN) inference. However, challenges still remain for in-situ DNN training with eNVMs due to the asymmetric weight update behavior, high programming latency and energy consumption. To overcome these challenges, a hybrid precision synapse combining eNVMs with capacitor has been proposed. It leverages the symmetric and fast weight update in the volatile capacitor, as well as the non-volatility and large dynamic range of the eNVMs. In this paper, in-situ DNN training architecture with hybrid precision synapses is proposed and benchmarked with the modified NeuroSim simulator. First, all the circuit modules required for in-situ training with hybrid precision synapses are designed. Then, the impact of weight transfer interval and limited capacitor retention time on training accuracy is investigated by incorporating hardware properties into Tensorflow simulation. Finally, a system-level benchmark is conducted for hybrid precision synapse compared with baseline design that is solely based on eNVMs.
{"title":"Benchmark Non-volatile and Volatile Memory Based Hybrid Precision Synapses for In-situ Deep Neural Network Training","authors":"Yandong Luo, Shimeng Yu","doi":"10.1109/ASP-DAC47756.2020.9045288","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045288","url":null,"abstract":"Compute-in-memory (CIM) with emerging non-volatile memories (eNVMs) is time and energy efficient for deep neural network (DNN) inference. However, challenges still remain for in-situ DNN training with eNVMs due to the asymmetric weight update behavior, high programming latency and energy consumption. To overcome these challenges, a hybrid precision synapse combining eNVMs with capacitor has been proposed. It leverages the symmetric and fast weight update in the volatile capacitor, as well as the non-volatility and large dynamic range of the eNVMs. In this paper, in-situ DNN training architecture with hybrid precision synapses is proposed and benchmarked with the modified NeuroSim simulator. First, all the circuit modules required for in-situ training with hybrid precision synapses are designed. Then, the impact of weight transfer interval and limited capacitor retention time on training accuracy is investigated by incorporating hardware properties into Tensorflow simulation. Finally, a system-level benchmark is conducted for hybrid precision synapse compared with baseline design that is solely based on eNVMs.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045410
M. Liao, J. Sampson
Recent works have shown that certain classes of emerging memory technologies lend themselves to organizations that offer equally dense access support for patterns with multiple strides, such as row-column memories. However, with few exceptions, these prior works have only considered such multi-orientation memories (MOMs) and MOM-caching techniques in the context of traditional processor architectures. In this work, we explore the potential for leveraging the capabilities of MOMs to present multiple concurrent views of data organization within the memory hierarchy as a means to offload and overlap inter-kernel marshalling, a range of data layout transformations, and even lazy construction of derivative data structures to work performed by the MOM-capable memories and caches themselves. We demonstrate the potential of MOM-offloading to improve performance and reduce data movement for select computation patterns and describe the application of the approach to broader classes of processing in memory workloads.
{"title":"Emerging memories as enablers for in-memory layout transformation acceleration and virtualization","authors":"M. Liao, J. Sampson","doi":"10.1109/ASP-DAC47756.2020.9045410","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045410","url":null,"abstract":"Recent works have shown that certain classes of emerging memory technologies lend themselves to organizations that offer equally dense access support for patterns with multiple strides, such as row-column memories. However, with few exceptions, these prior works have only considered such multi-orientation memories (MOMs) and MOM-caching techniques in the context of traditional processor architectures. In this work, we explore the potential for leveraging the capabilities of MOMs to present multiple concurrent views of data organization within the memory hierarchy as a means to offload and overlap inter-kernel marshalling, a range of data layout transformations, and even lazy construction of derivative data structures to work performed by the MOM-capable memories and caches themselves. We demonstrate the potential of MOM-offloading to improve performance and reduce data movement for select computation patterns and describe the application of the approach to broader classes of processing in memory workloads.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131150640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ferroelectric FETs (FeFETs) have emerged as a promising multi-level/cell (MLC) nonvolatile memory (NVM) candidate for low-power applications. This originates from the advantages of both efficient memory access and intrinsic device-level in-memory computing flexibilities. However, there still exist challenges for FeFET MLC NVM: (i) high power consumption in read operations due to high-gain requirement for sense amplifiers during sensing, and (ii) high latency and energy consumption in write operations with conventional recursive program-and-verify. Targeting at lower power, less latency, and higher density, this work investigates and optimizes the read and write approaches to MLC FeFET NVM design: (i) Adaptive FeFET memory State Mapping (ASM) between the FeFET drain-source current and the digital states to increase the sensing margin; (ii) Adaptive FeFET Gate Biasing (AGB) read methods that adopt the optimized FeFET gate voltage to boost the sensible dynamic range and to store more levels of states per cell; (iii) Adaptive Prediction-based Direct (APD) write methods that minimize the program-andverify activities. Evaluations show significant latency and energy improvement. Furthermore, the number of sensible levels of states per cell is also increased with an enhanced dynamic sensing range and an enhanced sensing margin.
{"title":"Adaptive Circuit Approaches to Low-Power Multi-Level/Cell FeFET Memory","authors":"Juejian Wu, Yixin Xu, Bowen Xue, Yu Wang, Yongpan Liu, Huazhong Yang, Xueqing Li","doi":"10.1109/ASP-DAC47756.2020.9045106","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045106","url":null,"abstract":"Ferroelectric FETs (FeFETs) have emerged as a promising multi-level/cell (MLC) nonvolatile memory (NVM) candidate for low-power applications. This originates from the advantages of both efficient memory access and intrinsic device-level in-memory computing flexibilities. However, there still exist challenges for FeFET MLC NVM: (i) high power consumption in read operations due to high-gain requirement for sense amplifiers during sensing, and (ii) high latency and energy consumption in write operations with conventional recursive program-and-verify. Targeting at lower power, less latency, and higher density, this work investigates and optimizes the read and write approaches to MLC FeFET NVM design: (i) Adaptive FeFET memory State Mapping (ASM) between the FeFET drain-source current and the digital states to increase the sensing margin; (ii) Adaptive FeFET Gate Biasing (AGB) read methods that adopt the optimized FeFET gate voltage to boost the sensible dynamic range and to store more levels of states per cell; (iii) Adaptive Prediction-based Direct (APD) write methods that minimize the program-andverify activities. Evaluations show significant latency and energy improvement. Furthermore, the number of sensible levels of states per cell is also increased with an enhanced dynamic sensing range and an enhanced sensing margin.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130793011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045220
Shikhar Tuli, M. Rios, A. Levisse, David Atienza Alonso
The growing need for connected, smart and energy efficient devices requires them to provide both ultra-low standby power and relatively high computing capabilities when awoken. In this context, emerging resistive memory technologies (RRAM) appear as a promising solution as they enable cheap fine grain technology co-integration with CMOS, fast switching and non-volatile storage. However, RRAM technologies suffer from fundamental flaws such as a strong device-to-device and cycle-to-cycle variability which is worsened by aging, forcing the designers to consider worst case design conditions. In this work, we propose, for the first time, a circuit that can take advantage of recently published Write Termination (WT) circuits from both the energy and performances point of view. The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process. By doing so, it averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail. We explore the design space of the proposed solution for various RRAM variability specifications, benchmark the effect of the proposed memory controller with real application memory traces and show (for the considered RRAM technology specifications) 44 % to 50 % performances improvement and from 10% to 85% energy gains depending on the application memory access patterns.
{"title":"RRAM-VAC: A Variability-Aware Controller for RRAM-based Memory Architectures","authors":"Shikhar Tuli, M. Rios, A. Levisse, David Atienza Alonso","doi":"10.1109/ASP-DAC47756.2020.9045220","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045220","url":null,"abstract":"The growing need for connected, smart and energy efficient devices requires them to provide both ultra-low standby power and relatively high computing capabilities when awoken. In this context, emerging resistive memory technologies (RRAM) appear as a promising solution as they enable cheap fine grain technology co-integration with CMOS, fast switching and non-volatile storage. However, RRAM technologies suffer from fundamental flaws such as a strong device-to-device and cycle-to-cycle variability which is worsened by aging, forcing the designers to consider worst case design conditions. In this work, we propose, for the first time, a circuit that can take advantage of recently published Write Termination (WT) circuits from both the energy and performances point of view. The proposed RRAM Variability Aware Controller (RRAM-VAC) stores and then coalesces the write requests from the processor before triggering the actual write process. By doing so, it averages the RRAM variability and enables the system to run at the memory programming time distribution mean rather than the worst case tail. We explore the design space of the proposed solution for various RRAM variability specifications, benchmark the effect of the proposed memory controller with real application memory traces and show (for the considered RRAM technology specifications) 44 % to 50 % performances improvement and from 10% to 85% energy gains depending on the application memory access patterns.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133096612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045616
Patrick Sittel, John Wickerson, M. Kumm, P. Zipf
In modulo scheduling, the number of clock cycles between successive inputs (the initiation interval, II) is traditionally an integer, but in this paper, we explore the benefits of allowing it to be a rational number. This rational II can be interpreted as the average number of clock cycles between successive inputs. As the minimum rational II can be less than the minimum integer II, this translates to higher throughput. We formulate rational-II modulo scheduling as an integer linear programming (ILP) problem that is able to find latency-optimal schedules for a fixed rational II. We have applied our scheduler to a standard benchmark of hardware designs, and our results demonstrate a significant speedup compared to state-of-the-art integer-II and rational-II formulations.
{"title":"Modulo Scheduling with Rational Initiation Intervals in Custom Hardware Design","authors":"Patrick Sittel, John Wickerson, M. Kumm, P. Zipf","doi":"10.1109/ASP-DAC47756.2020.9045616","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045616","url":null,"abstract":"In modulo scheduling, the number of clock cycles between successive inputs (the initiation interval, II) is traditionally an integer, but in this paper, we explore the benefits of allowing it to be a rational number. This rational II can be interpreted as the average number of clock cycles between successive inputs. As the minimum rational II can be less than the minimum integer II, this translates to higher throughput. We formulate rational-II modulo scheduling as an integer linear programming (ILP) problem that is able to find latency-optimal schedules for a fixed rational II. We have applied our scheduler to a standard benchmark of hardware designs, and our results demonstrate a significant speedup compared to state-of-the-art integer-II and rational-II formulations.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123535756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Enabled by recent development in silicon photonics, wavelength-routed optical networks-on-chips (WRONoCs) emerge as an appealing next-generation architecture for the communication in multiprocessor system-on-chip. WRONoCs apply a passive routing mechanism that statically reserves all data transmission paths at design time, and are thus able to avoid the latency and energy overhead for arbitration, compared to other ONoC architectures. Current research mostly assumes that in a WRONoC topology, each initiator node sends one bit at a time to a target node. However, the communication parallelism can be increased by assigning multiple wavelengths to each path, which requires a systematic analysis of the physical parameters of the silicon microring resonators and the wavelength usage among different paths. This work proposes a mathematical modeling method to maximize the communication parallelism of a given WRONoC topology, which provides a foundation for exploiting the bandwidth potential of WRONoCs. Experimental results show that the proposed method significantly outperforms the state-of-the-art approach, and is especially suitable for application-specific WRONoC topologies.
{"title":"Maximizing the Communication Parallelism for Wavelength-Routed Optical Networks-On-Chips","authors":"Mengchu Li, Tsun-Ming Tseng, Mahdi Tala, Ulf Schlichtmann","doi":"10.1109/ASP-DAC47756.2020.9045163","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045163","url":null,"abstract":"Enabled by recent development in silicon photonics, wavelength-routed optical networks-on-chips (WRONoCs) emerge as an appealing next-generation architecture for the communication in multiprocessor system-on-chip. WRONoCs apply a passive routing mechanism that statically reserves all data transmission paths at design time, and are thus able to avoid the latency and energy overhead for arbitration, compared to other ONoC architectures. Current research mostly assumes that in a WRONoC topology, each initiator node sends one bit at a time to a target node. However, the communication parallelism can be increased by assigning multiple wavelengths to each path, which requires a systematic analysis of the physical parameters of the silicon microring resonators and the wavelength usage among different paths. This work proposes a mathematical modeling method to maximize the communication parallelism of a given WRONoC topology, which provides a foundation for exploiting the bandwidth potential of WRONoCs. Experimental results show that the proposed method significantly outperforms the state-of-the-art approach, and is especially suitable for application-specific WRONoC topologies.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124592529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045132
Uzair Sharif, Daniel Mueller-Gritschneder, Ulf Schlichtmann
It has long been acknowledged that some applications feature inherent resilience against soft errors, e.g., the impact of soft errors on multimedia applications is often non-visible to humans. In this paper we investigate the inherent resilience of two typical embedded applications using a case study of a control system and a robot arm. Both studies were enabled by our mixed-mode fault injection simulator ETISS-ML, which allows RTL-accurate fault injection while being able to simulate very long scenarios, e.g. robot movements of several seconds. Our results indicate that full simulation of the embedded system and its environment are required to classify whether the system can tolerate the impact of a soft error. This is due to the fact that it is hard to predict the impact of a certain output deviation without investigating the change in the system behavior taking into account the control loop. Based on this classification method we hope to be able to exploit this resilience for lowering the cost of error detection mechanisms in future research.
{"title":"Investigating the Inherent Soft Error Resilience of Embedded Applications by Full-System Simulation","authors":"Uzair Sharif, Daniel Mueller-Gritschneder, Ulf Schlichtmann","doi":"10.1109/ASP-DAC47756.2020.9045132","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045132","url":null,"abstract":"It has long been acknowledged that some applications feature inherent resilience against soft errors, e.g., the impact of soft errors on multimedia applications is often non-visible to humans. In this paper we investigate the inherent resilience of two typical embedded applications using a case study of a control system and a robot arm. Both studies were enabled by our mixed-mode fault injection simulator ETISS-ML, which allows RTL-accurate fault injection while being able to simulate very long scenarios, e.g. robot movements of several seconds. Our results indicate that full simulation of the embedded system and its environment are required to classify whether the system can tolerate the impact of a soft error. This is due to the fact that it is hard to predict the impact of a certain output deviation without investigating the change in the system behavior taking into account the control loop. Based on this classification method we hope to be able to exploit this resilience for lowering the cost of error detection mechanisms in future research.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115502863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045745
Wenjian He, Wei Zhang, Sharad Sinha, Sanjeev Das
Hardware accelerators such as integrated graphics processing units (iGPUs) are increasingly prevalent in modern systems. They typically provide multiplexing support where several user applications can share the iGPU acceleration resources. However, security in this setting has not received sufficient consideration. In this work, we disclose a critical information leakage vulnerability due to defective GPU context management. In essence, residual register values and shared local memory in the iGPU are not cleared during a context switch. As a result, adversaries can recover the secret key of a cryptographic algorithm running on an iGPU from a single snapshot of the leaking channel. User privacy is also under threat due to browser activity eavesdropping through website-fingerprinting attack with high accuracy and resolution. Moreover, this vulnerability can constitute a covert channel with a bandwidth of up to 8 Gbps.
{"title":"iGPU Leak: An Information Leakage Vulnerability on Intel Integrated GPU","authors":"Wenjian He, Wei Zhang, Sharad Sinha, Sanjeev Das","doi":"10.1109/ASP-DAC47756.2020.9045745","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045745","url":null,"abstract":"Hardware accelerators such as integrated graphics processing units (iGPUs) are increasingly prevalent in modern systems. They typically provide multiplexing support where several user applications can share the iGPU acceleration resources. However, security in this setting has not received sufficient consideration. In this work, we disclose a critical information leakage vulnerability due to defective GPU context management. In essence, residual register values and shared local memory in the iGPU are not cleared during a context switch. As a result, adversaries can recover the secret key of a cryptographic algorithm running on an iGPU from a single snapshot of the leaking channel. User privacy is also under threat due to browser activity eavesdropping through website-fingerprinting attack with high accuracy and resolution. Moreover, this vulnerability can constitute a covert channel with a bandwidth of up to 8 Gbps.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126833958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-01-01DOI: 10.1109/ASP-DAC47756.2020.9045734
M. Kabir, Yarui Peng
Chiplet integration using 2.5D packaging is gaining popularity nowadays which enables several interesting features like heterogeneous integration and drop-in design method. In the traditional die-by-die approach of designing a 2.5D system, each chiplet is designed independently without any knowledge of the package RDLs. In this paper, we propose a Chip-Package Co-Design flow for implementing 2.5D systems using existing commercial chip design tools. Our flow encompasses 2.5D-aware partitioning suitable for SoC design, Chip-Package Floorplanning, and post-design analysis and verification of the entire 2.5D system. We also designed our own package planners to route RDL layers on top of chiplet layers. We use an ARM Cortex-M0 SoC system to illustrate our flow and compare analysis results with a monolithic 2D implementation of the same system. We also compare two different 2.5D implementations of the same SoC system following the drop-in approach. Alongside the traditional die-by-die approach, our holistic flow enables design efficiency and flexibility with accurate cross-boundary parasitic extraction and design verification.
{"title":"Chiplet-Package Co-Design For 2.5D Systems Using Standard ASIC CAD Tools","authors":"M. Kabir, Yarui Peng","doi":"10.1109/ASP-DAC47756.2020.9045734","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045734","url":null,"abstract":"Chiplet integration using 2.5D packaging is gaining popularity nowadays which enables several interesting features like heterogeneous integration and drop-in design method. In the traditional die-by-die approach of designing a 2.5D system, each chiplet is designed independently without any knowledge of the package RDLs. In this paper, we propose a Chip-Package Co-Design flow for implementing 2.5D systems using existing commercial chip design tools. Our flow encompasses 2.5D-aware partitioning suitable for SoC design, Chip-Package Floorplanning, and post-design analysis and verification of the entire 2.5D system. We also designed our own package planners to route RDL layers on top of chiplet layers. We use an ARM Cortex-M0 SoC system to illustrate our flow and compare analysis results with a monolithic 2D implementation of the same system. We also compare two different 2.5D implementations of the same SoC system following the drop-in approach. Alongside the traditional die-by-die approach, our holistic flow enables design efficiency and flexibility with accurate cross-boundary parasitic extraction and design verification.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127395189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}