CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems

Zhang, Fan; Malluhi, Qutaibah M.; Elsayed, Tamer; Khan, Samee U.; Li, Keqin; Zomaya, Albert Y.

المؤلف	Zhang, Fan
المؤلف	Malluhi, Qutaibah M.
المؤلف	Elsayed, Tamer
المؤلف	Khan, Samee U.
المؤلف	Li, Keqin
المؤلف	Zomaya, Albert Y.
تاريخ الإتاحة	2015-12-29T12:54:37Z
تاريخ النشر	2015-10
اسم المنشور	Future Generation Computer Systems
المصدر	Scopus
المعرّف	http://dx.doi.org/10.1016/j.future.2014.10.028
الاقتباس	Zhang F., Malluhi Q.M., Elsayed T., Khan S.U., Li K., Zomaya A.Y., CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems, (2015) Future Generation Computer Systems, 51, pp. 98-110.
الرقم المعياري الدولي للكتاب	0167-739X
معرّف المصادر الموحد	http://hdl.handle.net/10576/4019
الملخص	Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions. To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.
راعي المشروع	NPRP grant # 09-1116-1-172 from the Qatar National Research Fund (a member of Qatar Foundation). Ministry of Science and Technology of China under National 973 Basic Research Program (Grant No. 2013CB228206), National Natural Science Foundation of China (Grant Nos. 61472200 and 61233016).
اللغة	en
الناشر	Elsevier
الموضوع	Concurrency Data aware HPC MapReduce Programming model
العنوان	CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems
النوع	Article
الصفحات	98-110
رقم المجلد	51

تحقق من خيارات الوصول

الملفات في هذه التسجيلة

الملفات	الحجم	الصيغة	العرض
لا توجد ملفات لها صلة بهذه التسجيلة.

هذه التسجيلة تظهر في المجموعات التالية

الشبكات وخدمات البنية التحتية للمعلومات والبيانات [‎142‎ items ]

عرض بسيط للتسجيلة

CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems

الملفات في هذه التسجيلة

هذه التسجيلة تظهر في المجموعات التالية

Video