rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks

Ali Mohammed, Aurélien Cavelan, F. Ciorba
{"title":"rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks","authors":"Ali Mohammed, Aurélien Cavelan, F. Ciorba","doi":"10.1109/HPCS48598.2019.9188153","DOIUrl":null,"url":null,"abstract":"Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
rDLB:具有独立任务的科学应用鲁棒动态负载平衡的新方法
在高性能计算(HPC)系统上执行的并行科学应用程序通常包含大型且计算密集型的并行循环。这些应用程序的独立循环迭代表示独立的任务。动态负载平衡(DLB)用于实现这类应用程序的平衡执行。然而,大多数通常用于实现DLB的基于自调度的技术对于大型HPC系统上出现的组件(例如处理器、网络)故障或扰动并不健壮。容忍故障和/或扰动的基于自调度的技术依赖于故障和/或扰动检测机制的存在,以触发在故障和/或扰动组件上调度的任务的重新调度。这项工作提出了一种新的鲁棒动态负载平衡(rDLB)方法,用于在故障和/或扰动下高性能计算系统上具有独立任务的科学应用程序的鲁棒自调度。rDLB主动重新调度已经分配的任务,不需要检测故障或扰动。将rDLB集成到基于mpi的DLB库中。rDLB的分析建模表明,对于固定的问题大小,容错开销随着处理器数量的增加而线性降低。实验评估表明,使用rDLB的应用程序最多可以容忍p - 1个工作处理器故障(p是分配给应用程序的处理器数量),并且与不使用rDLB的情况相比,它们在存在扰动的情况下的性能提高了7倍。此外,与不使用rDLB的情况相比,使用rDLB的应用程序对扰动的鲁棒性(即灵活性)提高了30倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Queue Waiting Time Prediction for Large-scale High-performance Computing System Data Aware Simulation of Complex Systems on GPUs Hardware Acceleration of Kalman Filter for Leak Detection in Water Pipeline Systems using Wireless Sensor Network Performance Counters based Power Modeling of Mobile GPUs using Deep Learning Performance Prediction for Power-Capped Applications based on Machine Learning Algorithms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1