Efficient Distributed-Memory Parallelism

How should work be divided across the nodes of a supercomputer to achieve good load balance and minimize execution time?

Supercomputing simulations generate massive data sets, with each time slice containing billions or even trillions of cells. Such data is so large that it cannot fit into the memory of a single compute node. Instead, the data is decomposed into pieces (sometimes called blocks or domains) and each compute node works on a subset of the pieces at any given time. One processing approach is to parallelize over the pieces. In this case, the pieces are partitioned over the compute nodes. For example, if there were ten pieces, P0 through P9, and two compute nodes, N0 and N1, then N0 could operate on P0 through P4 and N1 could operate on P5 through P9. Another approach is to parallelize over the visualization work that needs to be performed. In this case, compute nodes would fetch the pieces they need to carry out their work. This fetching can be from a disk with post hoc processing or from another compute node with in situ processing. There are also hybrid approaches that use elements of parallelizing both over pieces and over work. For each individual visualization algorithm, the key research question is what approach will lead to the fastest execution time.

Prior to the founding of the CDUX research group, Hank Childs, along with collaborators, pursued many works considering efficiently parallelizing visualization algorithms for supercomputers, including volume rendering, ray tracing, streamlines, and connected components, among others. For many of the works, the unit of parallelization was not obvious, and the contribution was finding a middle ground between two extremes, i.e., a middle ground for volume rendering between parallelizing over samples and over cells, a middle ground for streamlines between parallelizing over particles and over cells, and a middle ground for ray tracing between parallelizing over rays and over cells. A separate research arc considered the effects of incorporating hybrid parallelism (i.e., both shared- and distributed-memory parallelism). The most interesting findings were for algorithms that stressed communication: volume rendering with multi-core CPUs and GPUs, and again for particle advection with multi-core CPUs and GPUs. In many of these cases, and in particular in the TVCG work by Camp et al., these studies found surprising speedups for hybrid parallelism, due to combined effects from increased efficiency within a node and from reduced traffic across nodes. Finally, Childs led a scalability study for visualization software processing trillions of cells on tens of thousands of cores. The lasting impact of this paper has been to demonstrate the extent that visualization software is I/O-bound, and thus motivate the push to in situ processing.

CDUX students have continued to innovate new directions for efficient parallelism. Roba Binyahib did a series of works on efficient parallel particle advection. Her first study extended an existing work stealing approach to use the Lifeline method, earning a Best Paper Honorable Mention at LDAV19. She then performed a bakeoff study, comparing four parallelization approaches, with concurrencies of up to 8192 cores, data sets as large as 34 billion cells, and as many as 300 million particles. Next, Roba designed a new meta-algorithm called HyLiPoD, which built on her bakeoff results to design an algorithm that adapts its parallelization based on workload. This work was awarded Best Short Paper at EGPGV. Her final work on particle advection considered in situ settings, and challenged the common assumption that simulation data should not be moved. Roba also pursued parallelization outside of particle advection, extending Hank's volume rendering algorithm to a TVCG publication. Her study performed experiments that more conclusively demonstrated the benefit of the algorithm, as well as improving on some deficiencies with respect to memory footprint and communication. Other CDUX students have also pursued efficient parallelization. Sam Schwartz also considered parallel particle advection, and specifically how machine learning can be used to optimize settings for frequency of communication and message size. Matt Larsen considered rendering image databases and optimizations for communication that are possible when rendering multiple simultaneously. Finally, Ryan Bleile and Jordan Weiler also improved the state-of-the-art for connected components computations by improving efficiency in communication.

CDUX People

Roba Binyahib
Ph.D. Student (alum)

Matt Larsen
Ph.D. Student (alum)

Ryan Bleile
Ph.D. Student (alum)

Jordan Weiler
M.S. Student (alum)

Hank Childs
CDUX Director

Publications by CDUX Students

Parallel Particle Advection Bake-Off for Scientific Visualization Workloads
Roba Binyahib, David Pugmire, Abhishek Yenpure, and Hank Childs
IEEE Cluster Conference, Kobe, Japan, September 2020

[PDF] [BIB]

A Lifeline-Based Approach for Work Requesting and Parallel Particle Advection
Roba Binyahib, David Pugmire, Boyana Norris, and Hank Childs
IEEE Symposium on Large Data Analysis and Visualization (LDAV), Vancouver, Canada, October 2019
Best Paper Honorable Mention

[PDF] [BIB]

HyLiPoD: Parallel Particle Advection Via a Hybrid of Lifeline Scheduling and Parallelization-Over-Data
Roba Binyahib, David Pugmire, and Hank Childs
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV), Zurich, Switzerland, June 2021
Best Short Paper

[PDF] [BIB]

A Scalable Hybrid Scheme for Ray-Casting of Unstructured Volume Data
Roba Binyahib, Tom Peterka, Matthew Larsen, Kwan-Liu Ma, and Hank Childs
IEEE Transactions on Visualization and Computer Graphics, July 2019

[PDF] [BIB]

Machine Learning-Based Autotuning for Parallel Particle Advection
Samuel D. Schwartz, Hank Childs, and David Pugmire
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV), Zurich, Switzerland, June 2021

[PDF] [BIB]

Optimizing Multi-Image Sort-Last Parallel Rendering
Matthew Larsen, Ken Moreland, Chris Johnson, and Hank Childs
IEEE Symposium on Large Data Analysis and Visualization (LDAV), Baltimore, MD, October 2016

[PDF] [BIB]

A Distributed-Memory Algorithm for Connected Components Labeling of Simulation Data
Cyrus Harrison, Jordan Weiler, Ryan Bleile, Kelly Gaither, and Hank Childs
Topological and Statistical Methods for Complex Data, December 2014

[LINK] [BIB]

In Situ Particle Advection Via Parallelizing Over Particles
Roba Binyahib, David Pugmire, and Hank Childs
ISAV 2019: In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization, Denver, CO, November 2019

[PDF] [BIB]

Publications Prior to the Founding of CDUX

Exploring the Spectrum of Dynamic Scheduling Algorithms for Scalable Distributed-Memory Ray Tracing
Paul Navratil, Hank Childs, Donald Fussell, and Calvin Lin
IEEE Transactions on Visualization and Computer Graphics (TVCG), June 2014

[PDF] [BIB]

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting
David Camp, Hari Krishnan, David Pugmire, Christoph Garth, Ian Johnson, Wes Bethel, Kenneth I. Joy, and Hank Childs
EuroGraphics Symposium on Parallel Graphics and Visualization (EGPGV), Girona, Spain, May 2013

[PDF] [BIB]

Dynamic Scheduling for Large-Scale Distributed-Memory Ray Tracing
Paul Navratil, Donald Fussell, Calvin Lin, and Hank Childs
EGPGV, Cagliari, Italy, May 2012
Best Paper

[PDF] [BIB]

Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems
Mark Howison, Wes Bethel, and Hank Childs
IEEE Transactions on Visualization and Computer Graphics (TVCG), January 2012

[PDF] [BIB]

Streamline Integration Using MPI-Hybrid Parallelism on a Large Multicore Architecture
David Camp, Christoph Garth, Hank Childs, David Pugmire, and Kenneth I. Joy
IEEE Transactions on Visualization and Computer Graphics (TVCG), November 2011

[PDF] [BIB]

Data-Parallel Mesh Connected Components Labeling and Analysis
Cyrus Harrison, Hank Childs, and Kelly Gaither
EGPGV, Llandudno, Wales, April 2011

[PDF] [BIB]

Large Data Visualization on Distributed Memory Multi-GPU Clusters
Thomas Fogal, Hank Childs Siddharth Shankar, Jens Krueger, R Daniel Bergeron, Philip Hatcher
HPG: Proceedings of the Conference on High Performance Graphics, June 2010

[PDF] [BIB]

MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems
Mark Howison, Wes Bethel, and Hank Childs
EGPGV, Norrkoping, Sweden, May 2010
Best Paper

[PDF] [BIB]

Extreme Scaling of Production Visualization Software on Diverse Architectures
Hank Childs, David Pugmire, Sean Ahern, Brad Whitlock, Mark Howison, Prabhat, Gunther Weber, and Wes Bethel
IEEE Computer Graphics and Applications (CG&A), May 2010

[PDF] [BIB]

Scalable Computation of Streamlines on Very Large Datasets
David Pugmire, Hank Childs, Christoph Garth, Sean Ahern, and Gunther Weber
ACM/IEEE Conference on High Performance Computing (SC09), Portland, OR, November 2009

[PDF] [BIB]

A Scalable, Hybrid Scheme for Volume Rendering Massive Data Sets
Hank Childs, Mark Duchaineau, and Kwan-Liu Ma
EGPGV, Braga, Portugal, May 2006

[PDF] [BIB]