论文标题
将基于数组的循环转换为分布数据并行程序
Translation of Array-Based Loops to Distributed Data-Parallel Programs
论文作者
论文摘要
科学实验和仿真生成的大量数据以数组的形式出现,而分析这些数据的程序经常以势在必行的,基于循环的语言来表示。但是,随着数据集的增长,分布式大数据分析中的新框架已成为大规模科学计算的重要工具。科学家通常对数值分析工具感到满意,但对大数据分析的复杂性不熟悉,现在必须学会将基于循环的程序转换为分布数据并行程序。我们提出了一个新颖的框架,用于翻译以基于阵列的循环为基于数组的循环的程序为分布式数据并行程序,该程序比相关工作更通用和高效。尽管我们的翻译量超过了稀疏的阵列,但我们将框架扩展到处理包装的阵列,例如瓷砖矩阵,而无需牺牲性能。我们在Spark上报告了原型实现,并评估了系统相对于手写程序的性能。
Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically comfortable with numerical analysis tools but are not familiar with the intricacies of Big Data analytics, must now learn to convert their loop-based programs to distributed data-parallel programs. We present a novel framework for translating programs expressed as array-based loops to distributed data parallel programs that is more general and efficient than related work. Although our translations are over sparse arrays, we extend our framework to handle packed arrays, such as tiled matrices, without sacrificing performance. We report on a prototype implementation on top of Spark and evaluate the performance of our system relative to hand-written programs.