Faster cloud Star Joins with reduced disk spill and network communication

Brito, Jaqueline Joice; Mosqueiro, Thiago; Ciferri, Ricardo Rodrigues; de Aguiar Ciferri, Cristina Dutra; Altintas, I; Norman, M; Dongarra, J; Krzhizhanovskaya, VV; Lees, M; Sloot, PMA

Texto completo
Autor(es):	Brito, Jaqueline Joice ; Mosqueiro, Thiago ; Ciferri, Ricardo Rodrigues ; de Aguiar Ciferri, Cristina Dutra ; Altintas, I ; Norman, M ; Dongarra, J ; Krzhizhanovskaya, VV ; Lees, M ; Sloot, PMA Número total de Autores: 10
Tipo de documento:	Artigo Científico
Fonte:	PROCEEDINGS OF THE XI LATIN AND AMERICAN ALGORITHMS, GRAPHS AND OPTIMIZATION SYMPOSIUM; v. 80, p. 12-pg., 2016-01-01.
Resumo
Combining powerful parallel frameworks and on-demand commodity hardware, cloud computing has made both analytics and decision support systems canonical to enterprises of all sizes. Associated with unprecedented volumes of data stacked by such companies, filtering and retrieving them are pressing challenges. This data is often organized in star schemas, in which Star Joins are ubiquitous and expensive operations. In particular, excessive disk spill and network communication are tight bottlenecks for all current MapReduce or Spark solutions. Here, we propose two efficient solutions that drop the computation time by at least 60%: the Spark Bloom-Filtered Cascade Join (SBFCJ) and the Spark Broadcast Join (SBJ). Conversely, a direct Spark implementation of a sequence of joins renders poor performance, showcasing the importance of further filtering for minimal disk spill and network communication. Finally, while SBJ is twice faster when memory per executor is large enough, SBFCJ is remarkably resilient to low memory scenarios. Both algorithms pose very competitive solutions to Star Joins in the cloud. (AU)

Processo FAPESP:	12/13158-9 - Armazenamento e recuperação de dados de data warehouses em ambientes de computação em nuvem
Beneficiário:	Jaqueline Joice Brito
Modalidade de apoio:	Bolsas no Brasil - Doutorado

URL curto