2007 IEEE International Symposium on Performance Analysis of Systems & Software最新文献

A Comparison of Two Approaches to Parallel Simulation of Multiprocessors 多处理器并行仿真的两种方法比较

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363732

A. Over, Bill Clarke, P. Strazdins

The design trend towards CMPs has made the simulation of multiprocessor systems a necessity and has also made multiprocessor systems widely available. While a serial multiprocessor simulation necessarily imposes a linear slowdown, running such a simulation in parallel may help mitigate this effect. In this paper we document our experiences with two different methods of parallelizing Sparc Sulima, a simulator of UltraSPARC IIICu-based multiprocessor systems. In the first approach, a simple interconnect model within the simulator is parallelized non-deterministically using careful locking. In the second, a detailed interconnect model is parallelized while preserving determinism using parallel discrete event simulation (PDES) techniques. While both approaches demonstrate a threefold speedup using 4 threads on workloads from the NAS parallel benchmarks, speedup proved constrained by load-balancing between simulated processors. A theoretical model is developed to help understand why observed speedup is less than ideal. An analysis of the related speed-accuracy tradeoff in the first approach with respect to the simulation time quantum is also given; the results show that, for both serial and parallel simulation, a quantum in the order of a few hundreds of cycles represents a `sweet-spot', but parallel simulation is significantly more accurate for a given quantum size. As with the speedup analysis, these effects are workload dependent

cmp的设计趋势使多处理器系统的仿真成为必要，也使多处理器系统得到广泛应用。虽然串行多处理器模拟必然会造成线性减速，但并行运行这样的模拟可能有助于减轻这种影响。在本文中，我们记录了我们使用两种不同方法并行化Sparc Sulima的经验，Sparc Sulima是一种基于UltraSPARC iiicu的多处理器系统模拟器。在第一种方法中，模拟器中的简单互连模型使用谨慎的锁定进行非确定性并行化。其次，在保持确定性的同时，使用并行离散事件模拟(PDES)技术对详细的互连模型进行并行化。虽然这两种方法在NAS并行基准测试的工作负载上使用4个线程都可以实现三倍的加速，但事实证明，加速受到模拟处理器之间负载平衡的限制。开发了一个理论模型来帮助理解为什么观察到的加速不是理想的。本文还分析了第一种方法在仿真时间量子方面的速度-精度权衡;结果表明，对于串行和并行模拟，以几百个周期为顺序的量子代表一个“最佳点”，但并行模拟对于给定量子大小明显更准确。与加速分析一样，这些影响取决于工作负载

{"title":"A Comparison of Two Approaches to Parallel Simulation of Multiprocessors","authors":"A. Over, Bill Clarke, P. Strazdins","doi":"10.1109/ISPASS.2007.363732","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363732","url":null,"abstract":"The design trend towards CMPs has made the simulation of multiprocessor systems a necessity and has also made multiprocessor systems widely available. While a serial multiprocessor simulation necessarily imposes a linear slowdown, running such a simulation in parallel may help mitigate this effect. In this paper we document our experiences with two different methods of parallelizing Sparc Sulima, a simulator of UltraSPARC IIICu-based multiprocessor systems. In the first approach, a simple interconnect model within the simulator is parallelized non-deterministically using careful locking. In the second, a detailed interconnect model is parallelized while preserving determinism using parallel discrete event simulation (PDES) techniques. While both approaches demonstrate a threefold speedup using 4 threads on workloads from the NAS parallel benchmarks, speedup proved constrained by load-balancing between simulated processors. A theoretical model is developed to help understand why observed speedup is less than ideal. An analysis of the related speed-accuracy tradeoff in the first approach with respect to the simulation time quantum is also given; the results show that, for both serial and parallel simulation, a quantum in the order of a few hundreds of cycles represents a `sweet-spot', but parallel simulation is significantly more accurate for a given quantum size. As with the speedup analysis, these effects are workload dependent","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117048167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications 视频编解码器应用SIMD扩展中未对齐内存操作对性能的影响

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363737

M. Alvarez, E. Salamí, Alex Ramírez, M. Valero

Although SIMD extensions are a cost effective way to exploit the data level parallelism present in most media applications, we will show that they had have a very limited memory architecture with a weak support for unaligned memory accesses. In video codec, and other applications, the overhead for accessing unaligned positions without an efficient architecture support has a big performance penalty and in some cases makes vectorization counter-productive. In this paper we analyze the performance impact of extending the Altivec SIMD ISA with unaligned memory operations. Results show that for several kernels in the H.264/AVC media codec, unaligned access support provides a speedup up to 3.8times compared to the plain SIMD version, translating into an average of 1.2times in the entire application. In addition to providing a significant performance advantage, the use of unaligned memory instructions makes programming SIMD code much easier both for the manual developer and the auto vectorizing compiler

尽管SIMD扩展是利用大多数媒体应用程序中存在的数据级并行性的一种经济有效的方法，但我们将展示它们具有非常有限的内存体系结构，并且对未对齐内存访问的支持很弱。在视频编解码器和其他应用程序中，在没有有效架构支持的情况下访问未对齐位置的开销会带来很大的性能损失，并且在某些情况下会使矢量化产生反效果。本文分析了使用未对齐内存操作扩展Altivec SIMD ISA对性能的影响。结果表明，对于H.264/AVC媒体编解码器中的几个内核，与普通SIMD版本相比，未对齐访问支持提供了高达3.8倍的加速，在整个应用程序中平均为1.2倍。除了提供显著的性能优势之外，使用未对齐内存指令使得SIMD代码的编程对于手动开发人员和自动向量化编译器来说都变得更加容易

引用次数: 31

Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility CMP缓存容量和可访问性的建模和单次仿真

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363743

Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang

The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes

未来具有大量核心的芯片多处理器(cmp)面临着如何有效利用片上存储空间的难题。数据可访问性和有效片上容量之间的权衡已经得到了广泛的研究。它需要昂贵的模拟来理解广泛的设计空间。在本文中，我们首先开发了一个抽象模型，用于理解与数据复制程度相关的性能影响。为了克服抽象模型中多核之间缺乏实时交互的问题，我们提出了一种高效的单通道堆栈模拟方法来研究各种缓存组织在cmp上的性能。建议的全局堆栈逻辑上包含共享堆栈和每核私有堆栈，以收集单个模拟通道中每个内存引用的共享/私有重用(堆栈)距离。有了收集到的重用距离，就可以计算多个缓存组织的命中/未命中性能和平均内存访问时间。基本的堆栈模拟结果可以进一步推导出具有不同程度数据复制的其他CMP缓存组织。我们使用一组商业多线程工作负载，根据单个执行驱动的模拟来验证建模和堆栈结果，这些模拟考虑了实际的缓存参数和延迟。我们还比较了模拟和堆栈模拟节省的时间。结果表明，堆栈模拟可以准确地模拟所研究的各种缓存组织的性能，误差范围为2-9%，仅使用约8%的模拟时间。结果还表明，优化CMP片上存储的各种技术的有效性与工作负载的工作集以及总缓存大小密切相关

{"title":"Modeling and Single-Pass Simulation of CMP Cache Capacity and Accessibility","authors":"Xudong Shi, Feiqi Su, J. Peir, Ye Xia, Zhen Yang","doi":"10.1109/ISPASS.2007.363743","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363743","url":null,"abstract":"The future chip-multiprocessors (CMPs) with a large number of cores faces difficult issues in efficient utilizing on-chip storage space. Tradeoffs between data accessibility and effective on-chip capacity have been studied extensively. It requires costly simulations to understand a wide-spectrum of design spaces. In this paper, we first develop an abstract model for understanding the performance impact with respect to the degree of data replication. To overcome the lack of real-time interactions among multiple cores in the abstract model, we propose an efficient single-pass stack simulation method to study the performance of a variety of cache organizations on CMPs. The proposed global stack logically incorporates a shared stack and per-core private stacks to collect shared/private reuse (stack) distances for every memory reference in a single simulation pass. With the collected reuse distances, performance in terms of hits/misses and average memory access times can be calculated for multiple cache organizations. The basic stack simulation results can further derive other CMP cache organizations with various degrees of data replication. We verify both the modeling and the stack results against individual execution-driven simulations that consider realistic cache parameters and delays using a set of commercial multithreaded workloads. We also compare the simulation time saving with the stack simulation. The results show that stack simulation can accurately model the performance of various studied cache organizations with 2-9% error margins using only about 8% of the simulation time. The results also show that the effectiveness of various techniques for optimizing the CMP on-chip storage is closely related to the working sets of the workloads as well as the total cache sizes","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127250037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures 并发多线程微架构软错误漏洞分析

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363747

Wangyuan Zhang, Xin Fu, Tao Li, J. Fortes

Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Simultaneous multithreaded (SMT) architectures exploit thread-level parallelism to improve overall processor throughput. A great amount of research has been conducted in the past to investigate performance and power issues of SMT architectures. Nevertheless, the effect of multithreaded execution on a microarchitecture's vulnerability to soft error remains largely unexplored. To address this issue, we have developed a microarchitecture level soft error vulnerability analysis framework for SMT architectures. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify the impact of multithreading on a wide range of microarchitecture structures. We examine how the baseline SMT microarchitecture reliability profile varies with workload behavior, the number of threads and fetch policies. Our experimental results show that the overall vulnerability rises in multithreading architectures, while each individual thread shows less vulnerability. By considering both performance and reliability, SMT outperforms superscalar architectures. The SMT reliability and its tradeoff with performance vary across different fetch policies. With a detailed analysis of the experimental results, we point out a set of potential opportunities to reduce SMT microarchitecture vulnerability, which can serve as guidance to exploiting thread-aware reliability optimization techniques in the near future. To our knowledge, this paper presents the first effort to characterize microarchitecture vulnerability to soft error on SMT processors

半导体瞬态故障(即软错误)已成为微处理器可靠性日益严重的威胁。同步多线程(SMT)体系结构利用线程级并行性来提高总体处理器吞吐量。过去已经进行了大量的研究来调查SMT体系结构的性能和功耗问题。然而，多线程执行对微架构的软错误脆弱性的影响在很大程度上仍未被探索。为了解决这个问题，我们为SMT体系结构开发了一个微体系结构级别的软错误漏洞分析框架。使用一组混合的SPEC CPU 2000基准测试，我们量化了多线程对各种微架构结构的影响。我们将研究基准SMT微架构可靠性概要文件如何随工作负载行为、线程数量和获取策略而变化。我们的实验结果表明，在多线程体系结构中，整体漏洞增加，而每个线程的漏洞减少。考虑到性能和可靠性，SMT优于超标量体系结构。SMT的可靠性及其与性能的权衡因取策略的不同而不同。通过对实验结果的详细分析，我们指出了一组减少SMT微架构漏洞的潜在机会，这可以为在不久的将来开发线程感知可靠性优化技术提供指导。据我们所知，本文首次提出了表征SMT处理器上软错误的微架构脆弱性的努力

{"title":"An Analysis of Microarchitecture Vulnerability to Soft Errors on Simultaneous Multithreaded Architectures","authors":"Wangyuan Zhang, Xin Fu, Tao Li, J. Fortes","doi":"10.1109/ISPASS.2007.363747","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363747","url":null,"abstract":"Semiconductor transient faults (i.e. soft errors) have become an increasingly important threat to microprocessor reliability. Simultaneous multithreaded (SMT) architectures exploit thread-level parallelism to improve overall processor throughput. A great amount of research has been conducted in the past to investigate performance and power issues of SMT architectures. Nevertheless, the effect of multithreaded execution on a microarchitecture's vulnerability to soft error remains largely unexplored. To address this issue, we have developed a microarchitecture level soft error vulnerability analysis framework for SMT architectures. Using a mixed set of SPEC CPU 2000 benchmarks, we quantify the impact of multithreading on a wide range of microarchitecture structures. We examine how the baseline SMT microarchitecture reliability profile varies with workload behavior, the number of threads and fetch policies. Our experimental results show that the overall vulnerability rises in multithreading architectures, while each individual thread shows less vulnerability. By considering both performance and reliability, SMT outperforms superscalar architectures. The SMT reliability and its tradeoff with performance vary across different fetch policies. With a detailed analysis of the experimental results, we point out a set of potential opportunities to reduce SMT microarchitecture vulnerability, which can serve as guidance to exploiting thread-aware reliability optimization techniques in the near future. To our knowledge, this paper presents the first effort to characterize microarchitecture vulnerability to soft error on SMT processors","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131049613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Benefits of I/O Acceleration Technology (I/OAT) in Clusters 集群中I/O加速技术(I/OAT)的好处

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363752

K. Vaidyanathan, D. Panda

Packet processing in the TCP/IP stack at multi-gigabit data rates occupies a significant portion of the system overhead. Though there are several techniques to reduce the packet processing overhead on the sender-side, the receiver-side continues to remain as a bottleneck. I/O acceleration technology (I/OAT), developed by Intel, is a set of features particularly designed to reduce the receiver-side packet processing overhead. This paper studies the benefits of the I/OAT technology by extensive evaluations through micro-benchmarks as well as evaluations on two different application domains: (1) a multi-tier data-center environment and (2) a parallel virtual file system (PVFS). Our micro-benchmark evaluations show that I/OAT results in 38% lower overall CPU utilization in comparison with traditional communication. Due to this reduced CPU utilization, I/OAT delivers better performance and increased network bandwidth. Our experimental results with data-centers and file systems reveal that I/OAT can improve the total number of transactions processed by 14% and throughput by 12%, respectively. In addition, I/OAT can sustain a large number of concurrent threads (up to a factor of four as compared to non-I/OAT) in data-center environments, thus increasing the scalability of the servers

以千兆数据速率在TCP/IP堆栈中进行的数据包处理占据了系统开销的很大一部分。尽管有几种技术可以减少发送端的数据包处理开销，但接收端仍然是一个瓶颈。由Intel开发的I/O加速技术(I/OAT)是一组专门设计用于减少接收端数据包处理开销的特性。本文通过微基准测试和两个不同的应用领域(1)多层数据中心环境和(2)并行虚拟文件系统(PVFS)进行了广泛的评估，研究了I/OAT技术的好处。我们的微基准评估显示，与传统通信相比，I/OAT的总体CPU利用率降低了38%。由于降低了CPU利用率，I/OAT提供了更好的性能并增加了网络带宽。我们对数据中心和文件系统的实验结果表明，I/OAT可以分别将处理的事务总数提高14%，吞吐量提高12%。此外，I/OAT可以在数据中心环境中维持大量并发线程(与非I/OAT相比，最多可达4倍)，从而提高了服务器的可伸缩性

{"title":"Benefits of I/O Acceleration Technology (I/OAT) in Clusters","authors":"K. Vaidyanathan, D. Panda","doi":"10.1109/ISPASS.2007.363752","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363752","url":null,"abstract":"Packet processing in the TCP/IP stack at multi-gigabit data rates occupies a significant portion of the system overhead. Though there are several techniques to reduce the packet processing overhead on the sender-side, the receiver-side continues to remain as a bottleneck. I/O acceleration technology (I/OAT), developed by Intel, is a set of features particularly designed to reduce the receiver-side packet processing overhead. This paper studies the benefits of the I/OAT technology by extensive evaluations through micro-benchmarks as well as evaluations on two different application domains: (1) a multi-tier data-center environment and (2) a parallel virtual file system (PVFS). Our micro-benchmark evaluations show that I/OAT results in 38% lower overall CPU utilization in comparison with traditional communication. Due to this reduced CPU utilization, I/OAT delivers better performance and increased network bandwidth. Our experimental results with data-centers and file systems reveal that I/OAT can improve the total number of transactions processed by 14% and throughput by 12%, respectively. In addition, I/OAT can sustain a large number of concurrent threads (up to a factor of four as compared to non-I/OAT) in data-center environments, thus increasing the scalability of the servers","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130866816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Using Model Trees for Computer Architecture Performance Analysis of Software Applications 应用模型树进行软件应用程序的计算机体系结构性能分析

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363742

ElMoustapha Ould-Ahmed-Vall, J. Woodlee, Charles R. Yount, K. Doshi, S. Abraham

The identification of performance issues on specific computer architectures has a variety of important benefits such as tuning software to improve performance, comparing the performance of various platforms and assisting in the design of new platforms. In order to enable this analysis, most modern micro-processors provide access to hardware-based event counters. Unfortunately, features such as out-of-order execution, pre-fetching and speculation complicate the interpretation of the raw data. Thus, the traditional approach of assigning a uniform estimated penalty to each event does not accurately identify and quantify performance limiters. This paper presents a novel method employing a statistical regression-modeling approach to better achieve this goal. Specifically, a model-tree based approach based on the M5' algorithm is implemented and validated that accounts for event interactions and workload characteristics. Data from a subset of the SPEC CPU2006 suite is used by the algorithm to automatically build a performance-model tree, identifying the unique performance classes (phases) found in the suite and associating with each class a unique, explanatory linear model of performance events. These models can be used to identify performance problems for a given workload and estimate the potential gain from addressing each problem. This information can help orient the performance optimization efforts to focus available time and resources on techniques most likely to impact performance problems with highest potential gain. The model tree exhibits high correlation (more than 0.98) and low relative absolute error (less than 8 %) between predicted and measured performance, attesting it as a sound approach for performance analysis of modern superscalar machines

识别特定计算机体系结构上的性能问题有很多重要的好处，比如调优软件以提高性能、比较不同平台的性能以及帮助设计新平台。为了支持这种分析，大多数现代微处理器都提供了对基于硬件的事件计数器的访问。不幸的是，乱序执行、预取和推测等特性会使原始数据的解释复杂化。因此，为每个事件分配统一估计惩罚的传统方法不能准确地识别和量化性能限制因素。本文提出了一种采用统计回归建模方法的新方法来更好地实现这一目标。具体来说，实现并验证了基于M5算法的基于模型树的方法，该方法考虑了事件交互和工作负载特征。该算法使用来自SPEC CPU2006套件子集的数据来自动构建性能模型树，识别套件中发现的唯一性能类(阶段)，并将每个类与性能事件的唯一解释性线性模型相关联。这些模型可用于识别给定工作负载的性能问题，并估计解决每个问题的潜在收益。这些信息可以帮助确定性能优化工作的方向，将可用时间和资源集中在最有可能以最高潜在收益影响性能问题的技术上。模型树在预测和测量性能之间表现出高相关性(超过0.98)和低相对绝对误差(小于8%)，证明它是现代超标量机器性能分析的一种可靠方法

{"title":"Using Model Trees for Computer Architecture Performance Analysis of Software Applications","authors":"ElMoustapha Ould-Ahmed-Vall, J. Woodlee, Charles R. Yount, K. Doshi, S. Abraham","doi":"10.1109/ISPASS.2007.363742","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363742","url":null,"abstract":"The identification of performance issues on specific computer architectures has a variety of important benefits such as tuning software to improve performance, comparing the performance of various platforms and assisting in the design of new platforms. In order to enable this analysis, most modern micro-processors provide access to hardware-based event counters. Unfortunately, features such as out-of-order execution, pre-fetching and speculation complicate the interpretation of the raw data. Thus, the traditional approach of assigning a uniform estimated penalty to each event does not accurately identify and quantify performance limiters. This paper presents a novel method employing a statistical regression-modeling approach to better achieve this goal. Specifically, a model-tree based approach based on the M5' algorithm is implemented and validated that accounts for event interactions and workload characteristics. Data from a subset of the SPEC CPU2006 suite is used by the algorithm to automatically build a performance-model tree, identifying the unique performance classes (phases) found in the suite and associating with each class a unique, explanatory linear model of performance events. These models can be used to identify performance problems for a given workload and estimate the potential gain from addressing each problem. This information can help orient the performance optimization efforts to focus available time and resources on techniques most likely to impact performance problems with highest potential gain. The model tree exhibits high correlation (more than 0.98) and low relative absolute error (less than 8 %) between predicted and measured performance, attesting it as a sound approach for performance analysis of modern superscalar machines","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130289096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Phase-Guided Small-Sample Simulation 相位引导小样本仿真

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363739

J. L. Kihm, Samuel D. Strom, D. Connors

Detailed cycle-accurate simulation is a critical component of processor design. However, with the increasing complexity of modern processors and application workloads, full detailed simulation is prohibitively slow and thereby severely limits design space exploration. Sampled simulation techniques eliminate the need for full simulation by simulating in detail a very small but representative subset of a target application's overall execution. Two effective and accurate sampling techniques are phase-based simulation and small-sample simulation. Both of these techniques have been adopted by the architecture design and simulation communities for research. However, both techniques were derived using a single benchmark evaluation suite and promote the same sampling method for all applications. Alternatively, an execution-aware sampling-based simulation technique can adapt during execution characteristics of the individual application being simulated and achieve the most efficient and accurate simulation acceleration. To evaluate the impact of application characteristics on simulation approaches, we compare several simulation techniques using the SpedOOO benchmark suite. Our results yield key conclusions about combining the strengths of previous simulation techniques into a single approach: (PGSS) phase-guided small-sample simulation. PGSS adapts sampling to the characteristics of the application, thereby achieving high sampling accuracy and requiring an order of magnitude less detailed simulation time than previous techniques

详细的周期精确仿真是处理器设计的关键组成部分。然而，随着现代处理器和应用程序工作负载的日益复杂，完全详细的模拟速度非常慢，从而严重限制了设计空间的探索。采样模拟技术通过详细模拟目标应用程序整体执行的一个非常小但具有代表性的子集，从而消除了完全模拟的需要。两种有效且精确的采样技术是基于相位的模拟和小样本模拟。这两种技术都已被体系结构设计和仿真社区用于研究。然而，这两种技术都是使用单个基准评估套件推导出来的，并且对所有应用程序都采用相同的采样方法。另外，基于执行感知采样的仿真技术可以在被模拟的单个应用程序的执行过程中适应其特性，从而实现最有效和准确的仿真加速。为了评估应用程序特性对仿真方法的影响，我们使用speedo基准套件比较了几种仿真技术。我们的研究结果得出了将以前的模拟技术的优势结合到一个单一方法中的关键结论:(PGSS)相位引导小样本模拟。PGSS使采样适应应用的特点，从而实现高采样精度，并且比以前的技术需要更少的详细模拟时间

{"title":"Phase-Guided Small-Sample Simulation","authors":"J. L. Kihm, Samuel D. Strom, D. Connors","doi":"10.1109/ISPASS.2007.363739","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363739","url":null,"abstract":"Detailed cycle-accurate simulation is a critical component of processor design. However, with the increasing complexity of modern processors and application workloads, full detailed simulation is prohibitively slow and thereby severely limits design space exploration. Sampled simulation techniques eliminate the need for full simulation by simulating in detail a very small but representative subset of a target application's overall execution. Two effective and accurate sampling techniques are phase-based simulation and small-sample simulation. Both of these techniques have been adopted by the architecture design and simulation communities for research. However, both techniques were derived using a single benchmark evaluation suite and promote the same sampling method for all applications. Alternatively, an execution-aware sampling-based simulation technique can adapt during execution characteristics of the individual application being simulated and achieve the most efficient and accurate simulation acceleration. To evaluate the impact of application characteristics on simulation approaches, we compare several simulation techniques using the SpedOOO benchmark suite. Our results yield key conclusions about combining the strengths of previous simulation techniques into a single approach: (PGSS) phase-guided small-sample simulation. PGSS adapts sampling to the characteristics of the application, thereby achieving high sampling accuracy and requiring an order of magnitude less detailed simulation time than previous techniques","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114976777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Performance Characterization of Decimal Arithmetic in Commercial Java Workloads 商业Java工作负载中十进制算术的性能表征

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363736

M. Bhat, John Crawford, R. Morin, K. Shiv

Binary floating-point numbers with finite precision cannot represent all decimal numbers with complete accuracy. This can often lead to errors while performing calculations involving floating point numbers. For this reason many commercial applications use special decimal representations for performing these calculations, but their use carries performance costs such as bi-directional conversion. The purpose of this study was to understand the total application performance impact of using these decimal representations in commercial workloads, and provide a foundation of data to justify pursuing optimized hardware support for decimal math. In Java, a popular development environment for commercial applications, the BigDecimal class is used for performing accurate decimal computations. BigDecimal provides operations for arithmetic, scale manipulation, rounding, comparison, hashing, and format conversion. We studied the impact of BigDecimal usage on the performance of server-side Java applications by analyzing its usage on two standard enterprise benchmarks, SPECjbb2005 and SPECjAppServer2004 as well as a real-life mission-critical financial workload, Morgan Stanley's Trade Completion. In this paper, we present detailed performance characteristics and we conclude that, relative to total application performance, the overhead of using software decimal implementations is low, and at least from the point of view of these workloads, there is insufficient performance justification to pursue hardware solutions

有限精度的二进制浮点数不能表示具有完全精度的所有十进制数。这通常会在执行涉及浮点数的计算时导致错误。由于这个原因，许多商业应用程序使用特殊的十进制表示来执行这些计算，但是它们的使用带来了性能成本，例如双向转换。本研究的目的是了解在商业工作负载中使用这些十进制表示对应用程序性能的总体影响，并提供数据基础，以证明对十进制数学进行优化的硬件支持是合理的。在商业应用程序的流行开发环境Java中，BigDecimal类用于执行精确的十进制计算。BigDecimal提供了算术、比例操作、舍入、比较、散列和格式转换等操作。我们通过分析BigDecimal在两个标准企业基准测试(SPECjbb2005和SPECjAppServer2004)以及现实生活中的关键任务型金融工作负载(Morgan Stanley的Trade Completion)上的使用情况，研究了使用BigDecimal对服务器端Java应用程序性能的影响。在本文中，我们给出了详细的性能特征，并得出结论，相对于总体应用程序性能，使用软件十进制实现的开销很低，至少从这些工作负载的角度来看，没有足够的性能理由来追求硬件解决方案

{"title":"Performance Characterization of Decimal Arithmetic in Commercial Java Workloads","authors":"M. Bhat, John Crawford, R. Morin, K. Shiv","doi":"10.1109/ISPASS.2007.363736","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363736","url":null,"abstract":"Binary floating-point numbers with finite precision cannot represent all decimal numbers with complete accuracy. This can often lead to errors while performing calculations involving floating point numbers. For this reason many commercial applications use special decimal representations for performing these calculations, but their use carries performance costs such as bi-directional conversion. The purpose of this study was to understand the total application performance impact of using these decimal representations in commercial workloads, and provide a foundation of data to justify pursuing optimized hardware support for decimal math. In Java, a popular development environment for commercial applications, the BigDecimal class is used for performing accurate decimal computations. BigDecimal provides operations for arithmetic, scale manipulation, rounding, comparison, hashing, and format conversion. We studied the impact of BigDecimal usage on the performance of server-side Java applications by analyzing its usage on two standard enterprise benchmarks, SPECjbb2005 and SPECjAppServer2004 as well as a real-life mission-critical financial workload, Morgan Stanley's Trade Completion. In this paper, we present detailed performance characteristics and we conclude that, relative to total application performance, the overhead of using software decimal implementations is low, and at least from the point of view of these workloads, there is insufficient performance justification to pursue hardware solutions","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116888538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Characterizing a Complex J2EE Workload: A Comprehensive Analysis and Opportunities for Optimizations 描述复杂的J2EE工作负载:全面的分析和优化的机会

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363735

Yefim Shuf, I. Steiner

While past studies of relatively simple Java benchmarks like SPECjvm98 and SPECjbb2000 have been integral in advancing the server industry, this paper presents an analysis of a significantly more complex 3-Tier J2EE (Java 2 Enterprise Edition) commercial workload, SPECjAppServer2004. Understanding the nature of such commercial workloads is critical to develop the next generation of servers and identify promising directions for systems and software research. In this study, we validate and disprove several assumptions commonly made about Java workloads. For instance, on a tuned system with an appropriately sized heap, the fraction of CPU time spent on garbage collection for this complex workload is small (<2%) compared to commonly studied client-side Java benchmarks. Unlike small benchmarks, this workload has a rather "flat" method profile with no obvious hot spots. Therefore, new performance analysis techniques and tools to identify opportunities for optimizations are needed because the traditional 90/10 rule of thumb does not apply. We evaluate hardware performance monitor data and use insights to motivate future research. We find that this workload has a relatively high CPI and a branch misprediction rate. We observe that almost one half of executed instructions are loads and stores and that the data working set is large. There are very few cache-to-cache "modified data" transfers which limits opportunities for intelligent thread co-scheduling. We note that while using large pages for a Java heap is a simple and effective way to reduce TLB misses and improve performance, there is room to reduce translation misses further by placing executable code into large pages. We use statistical correlation to quantify the relationship between various hardware events and an overall system performance. We find that CPI is strongly correlated with branch mispredictions, translation misses, instruction cache misses, and bursty data cache misses that trigger data prefetching. We note that target address mispredictions for indirect branches (corresponding to Java virtual method calls) are strongly correlated with instruction cache misses. Our observations can be used by hardware and runtime architects to estimate potential benefits of performance enhancements being considered

虽然过去对相对简单的Java基准(如SPECjvm98和SPECjbb2000)的研究在推动服务器行业发展方面发挥了不可或缺的作用，但本文介绍了对更复杂的3层J2EE (Java 2企业版)商业工作负载SPECjAppServer2004的分析。了解此类商业工作负载的性质对于开发下一代服务器和确定系统和软件研究的有前途的方向至关重要。在本研究中，我们验证和反驳了关于Java工作负载的几个常见假设。例如，在具有适当大小堆的调优系统上，与通常研究的客户端Java基准测试相比，用于此复杂工作负载的垃圾收集的CPU时间比例很小(<2%)。与小型基准测试不同，此工作负载具有相当“平坦”的方法配置文件，没有明显的热点。因此，需要新的性能分析技术和工具来识别优化机会，因为传统的90/10法则不适用。我们评估硬件性能监控数据，并使用见解来激励未来的研究。我们发现此工作负载具有相对较高的CPI和分支错误预测率。我们观察到，几乎一半的执行指令是加载和存储指令，而且数据工作集很大。很少有缓存到缓存的“修改数据”传输，这限制了智能线程协同调度的机会。我们注意到，虽然为Java堆使用大页面是减少TLB错误和提高性能的一种简单而有效的方法，但是通过将可执行代码放入大页面中，还有进一步减少翻译错误的空间。我们使用统计相关性来量化各种硬件事件与整体系统性能之间的关系。我们发现CPI与分支错误预测、转换错误、指令缓存错误和触发数据预取的突发数据缓存错误密切相关。我们注意到，间接分支(对应于Java虚拟方法调用)的目标地址错误预测与指令缓存丢失密切相关。硬件和运行时架构师可以使用我们的观察结果来评估正在考虑的性能增强的潜在好处

{"title":"Characterizing a Complex J2EE Workload: A Comprehensive Analysis and Opportunities for Optimizations","authors":"Yefim Shuf, I. Steiner","doi":"10.1109/ISPASS.2007.363735","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363735","url":null,"abstract":"While past studies of relatively simple Java benchmarks like SPECjvm98 and SPECjbb2000 have been integral in advancing the server industry, this paper presents an analysis of a significantly more complex 3-Tier J2EE (Java 2 Enterprise Edition) commercial workload, SPECjAppServer2004. Understanding the nature of such commercial workloads is critical to develop the next generation of servers and identify promising directions for systems and software research. In this study, we validate and disprove several assumptions commonly made about Java workloads. For instance, on a tuned system with an appropriately sized heap, the fraction of CPU time spent on garbage collection for this complex workload is small (<2%) compared to commonly studied client-side Java benchmarks. Unlike small benchmarks, this workload has a rather \"flat\" method profile with no obvious hot spots. Therefore, new performance analysis techniques and tools to identify opportunities for optimizations are needed because the traditional 90/10 rule of thumb does not apply. We evaluate hardware performance monitor data and use insights to motivate future research. We find that this workload has a relatively high CPI and a branch misprediction rate. We observe that almost one half of executed instructions are loads and stores and that the data working set is large. There are very few cache-to-cache \"modified data\" transfers which limits opportunities for intelligent thread co-scheduling. We note that while using large pages for a Java heap is a simple and effective way to reduce TLB misses and improve performance, there is room to reduce translation misses further by placing executable code into large pages. We use statistical correlation to quantify the relationship between various hardware events and an overall system performance. We find that CPI is strongly correlated with branch mispredictions, translation misses, instruction cache misses, and bursty data cache misses that trigger data prefetching. We note that target address mispredictions for indirect branches (corresponding to Java virtual method calls) are strongly correlated with instruction cache misses. Our observations can be used by hardware and runtime architects to estimate potential benefits of performance enhancements being considered","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Using Wavelet Domain Workload Execution Characteristics to Improve Accuracy, Scalability and Robustness in Program Phase Analysis 利用小波域工作负载执行特性提高程序阶段分析的准确性、可扩展性和鲁棒性

2007 IEEE International Symposium on Performance Analysis of Systems & Software

Pub Date : 2007-04-25 DOI: 10.1109/ISPASS.2007.363744

Chang-Burm Cho, Tao Li

Program phase analysis has many applications in computer architecture design and optimization. Recently, there has been a growing interest in employing wavelets as a tool for phase analysis. Nevertheless, the examined scope of workload characteristics and the explored benefits due to wavelet-based analysis are quite limited. This work further extends prior research by applying wavelets analysis to abundant types of program execution statistics and quantifying the benefits of wavelet analysis in terms of accuracy, scalability and robustness in phase classification. Experimental results on SPEC CPU 2000 benchmarks show that compared with methods that work in the time domain, wavelet domain phase analysis achieves higher accuracy and exhibits superior scalability and robustness. We examine and contrast the effectiveness of applying wavelets to a wide range of runtime workload execution characteristics. We find that wavelet transform significantly reduces temporal dependence in the sampled workload statistics and therefore simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, we show that different types of workload execution characteristics in wavelet domain can be assembled together to further improve phase classification accuracy. For long-running, complex and real-world workloads, a scalable phase analysis technique is essential to capture the manifested large-scale program behavior. In this study, we show that such scalability can be achieved by applying wavelet analysis of high dimension sampled workload statistics to alleviate the counter overflow problem which can negatively affect phase classification accuracy. By exploiting the wavelet denoising capability, we show in this paper that phase classification can be performed robustly under program execution variability. To our knowledge, this work presents the first effort on using wavelets to improve scalability and robustness in phase analysis

程序阶段分析在计算机体系结构设计和优化中有着广泛的应用。近年来，人们对使用小波作为相位分析的工具越来越感兴趣。然而，工作量特征的研究范围和基于小波的分析所带来的好处是非常有限的。本工作进一步扩展了先前的研究，将小波分析应用于丰富类型的程序执行统计，并量化了小波分析在相位分类中的准确性、可扩展性和鲁棒性方面的优势。在SPEC CPU 2000基准测试上的实验结果表明，与在时域工作的方法相比，小波域相位分析具有更高的精度，并具有良好的可扩展性和鲁棒性。我们检查并对比了将小波应用于广泛的运行时工作负载执行特征的有效性。我们发现，小波变换显著地降低了采样工作负载统计的时间依赖性，因此在时域上不充分的简单模型在小波域上变得非常准确。更吸引人的是，我们证明了不同类型的工作负载执行特征可以在小波域中组合在一起，以进一步提高相位分类的准确性。对于长时间运行的、复杂的和真实的工作负载，可伸缩的阶段分析技术对于捕获已显示的大规模程序行为是必不可少的。在本研究中，我们证明了这种可扩展性可以通过对高维采样工作负载统计数据进行小波分析来实现，以缓解计数器溢出问题，从而降低相位分类精度。通过利用小波去噪的能力，我们证明了相位分类可以在程序执行变化的情况下稳健地进行。据我们所知，这项工作首次尝试使用小波来提高相位分析的可扩展性和鲁棒性

{"title":"Using Wavelet Domain Workload Execution Characteristics to Improve Accuracy, Scalability and Robustness in Program Phase Analysis","authors":"Chang-Burm Cho, Tao Li","doi":"10.1109/ISPASS.2007.363744","DOIUrl":"https://doi.org/10.1109/ISPASS.2007.363744","url":null,"abstract":"Program phase analysis has many applications in computer architecture design and optimization. Recently, there has been a growing interest in employing wavelets as a tool for phase analysis. Nevertheless, the examined scope of workload characteristics and the explored benefits due to wavelet-based analysis are quite limited. This work further extends prior research by applying wavelets analysis to abundant types of program execution statistics and quantifying the benefits of wavelet analysis in terms of accuracy, scalability and robustness in phase classification. Experimental results on SPEC CPU 2000 benchmarks show that compared with methods that work in the time domain, wavelet domain phase analysis achieves higher accuracy and exhibits superior scalability and robustness. We examine and contrast the effectiveness of applying wavelets to a wide range of runtime workload execution characteristics. We find that wavelet transform significantly reduces temporal dependence in the sampled workload statistics and therefore simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, we show that different types of workload execution characteristics in wavelet domain can be assembled together to further improve phase classification accuracy. For long-running, complex and real-world workloads, a scalable phase analysis technique is essential to capture the manifested large-scale program behavior. In this study, we show that such scalability can be achieved by applying wavelet analysis of high dimension sampled workload statistics to alleviate the counter overflow problem which can negatively affect phase classification accuracy. By exploiting the wavelet denoising capability, we show in this paper that phase classification can be performed robustly under program execution variability. To our knowledge, this work presents the first effort on using wavelets to improve scalability and robustness in phase analysis","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117280753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11