Portable Performance for Visualization Algorithms
Power constraints are forcing supercomputing architects to shift their focus from FLOPs to FLOPs-per-watt. In response, these architects are choosing nodes consisting of many cores per chip and wide vector units, since massive numbers of cores operating at relatively low clock speeds offer the best combination of performance for price and energy. However, there are many hardware architectures to choose from, both those available right now, and possibilities for the future. The top machines in the world currently are composed of technologies like programmable graphics processors (GPUs, e.g., NVIDIA Tesla), many-core co-processors (e.g., Intel Xeon Phi), and large multi-core CPUs (e.g., IBM Power, Intel Xeon). Further, future supercomputing designs may include low-power architectures (e.g., ARM), hybrid designs (e.g., AMD APU), or experimental designs (e.g., FPGA systems).
This variety in hardware architecture is problematic for software developers, as developers do not want to implement distinct solutions for each architecture. This issue is particularly problematic in the context of visualization software, for two reasons. One, visualization software often requires large code bases, with several community standards containing over a million lines of code. Two, visualization software employs many different algorithms; as a result, optimizing performance for one platform requires optimizing each of its algorithms, and not just one "key loop" as is often the case for simulation codes.
Ideally, software developers could write a single implementation that would simultaneously be insulated from architectural specifics and also obtain excellent performance across all desired architectures. This goal is one of the main drivers behind the recent push for domain-specific languages (DSLs) in high-performance computing. In the case of visualization software, three significant efforts --- Dax, EAVL, and PISTON --- all realized this goal by building a DSL-like infrastructure on top of data-parallel primitives. The three efforts have now merged into a single one, called VTK-m, with a goal of providing the same functionality as VTK, yet with portable performance across multi-core and many-core architectures.
While data-parallel primitives have shown significant promise to date, the downside of the approach is that our community's existing algorithms cannot be simply "ported" into this new framework. In many cases, the algorithms need to be "re-thought" so that they can be composed entirely of data-parallel operations. While some algorithms map naturally, others are more difficult, since isolating out the interdependence of operations --- needed so each core on a many-core node can do its own work without interacting with the other cores --- is not always trivial.
ResultsWith our research, we are both evaluating the merits of the data-parallel primitive approach, and also developing new algorithms in this environment. Highlights include:
Ph.D. Student (alum)