Cluster Programming using the OpenMP Accelerator Model

Yviquel, Herve; Cruz, Lauro; Araujo, Guido

Texto completo
Autor(es):	Yviquel, Herve ^[1] ; Cruz, Lauro ^[1] ; Araujo, Guido ^[1] Número total de Autores: 3
Afiliação do(s) autor(es):	^[1] Univ Estadual Campinas, UNICAMP, Inst Comp, Av Albert Einstein 1251, Cidade Univ, Campinas, SP - Brazil Número total de Afiliações: 1
Tipo de documento:	Artigo Científico
Fonte:	ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION; v. 15, n. 3 OCT 2018.
Citações Web of Science:	1
Resumo
Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation to GPUs, through directive-based annotation standards like OpenMP, offloading computation to very large computer clusters can become a complex and cumbersome task. It typically requires mixing programming models (e.g., OpenMP and MPI) and languages (e.g., C/C++ and Scala), dealing with various access control mechanisms from different cloud providers (e.g., AWS and Azure), and integrating all this into a single application. This article introduces computer cluster nodes as simple OpenMP offloading devices that can be used either from a local computer or from the cluster head-node. It proposes a methodology that transforms OpenMP directives to Spark runtime calls with fully integrated communication management, in a way that a cluster appears to the programmer as yet another accelerator device. Experiments using LLVM 3.8, OpenMP 4.5 on well known cloud infrastructures (Microsoft Azure and Amazon EC2) show the viability of the proposed approach, enable a thorough analysis of its performance, and make a comparison with an MPI implementation. The results show that although data transfers can impose overheads, cloud offloading from a local machine can still achieve promising speedups for larger granularity: up to 115x in 256 cores for the 2MM benchmark using 1GB sparse matrices. In addition, the parallel implementation of a complex and relevant scientific application reveals a 80x speedup on a 320 core machine when executed directly from the headnode of the cluster. (AU)

Processo FAPESP:	14/25694-8 - Paralelização de laços usando map-reduce na nuvem para cargas de trabalho científicas
Beneficiário:	Hervé Yviquel
Modalidade de apoio:	Bolsas no Brasil - Pós-Doutorado


Processo FAPESP:	17/21339-7 - Paralelização de laços e tarefas usando map-reduce em clusters heterogêneos na nuvem para cargas de trabalho científicas
Beneficiário:	Hervé Yviquel
Modalidade de apoio:	Bolsas no Exterior - Estágio de Pesquisa - Pós-Doutorado

URL curto