This research project is to solve large-scale linear systems in seismic tomography. This is a collaborative project of Department of Computer Science and Department of Geology and Geophysics in University of Wyoming, and National Center for Atmospheric Research (NCAR). Structural seismology is the study of the internal structure of the Earth using seismic waves obtained on the surface. It plays an important role in a wide range of scientific as well as practical applications. Today, advances in computational technology, seismic instrumentation, and numerical methods have enabled large-scale seismic surveys and full-waveform tomographic inversions that can provide images of the Earth's internal structure at unprecedented resolution. A key component in today's seismic tomographic inversions is the solution of large-scale linear systems.
Least Square with QR factorization (LSQR) is a popular Krylov subspace method widely used in structural seismology. It has been shown faster than other algorithms in tomographic experiments. However, applying LSQR to large-scale seismic tomography problem could be computationally very challenging mainly because it is simultaneously compute-, memory- and communication-intensive. The coefficient matrix in the linear system is typically very large and sparse. E.g., a modest-sized dataset of Los Angles Basin could have 261 million rows and 38 million columns with 5 billion nonzero values. In addition, the nonzero distribution is very uneven. Some parts of the matrix is random and nearly dense, while some parts are highly structured and extremely sparse. Advances in structural seismology will likely increase the order of the design matrix by a factor of three. There are several previous research on parallelizing LSQR algorithm in seismic community and also a parallel solver in PETSc library, a well-known mathematical library for parallel solution of scientific applications. However, these approaches suffer from communication problem when performing large-scale test. Therefore, an algorithm that scales with both problem size and core count is necessary.
Our research project of LSQR can be divided into two phases. In phase-I (September 2009 - July 2011), we build an initial and reliable parallel implementation for LSQR, and also utilize emerging accelerators like GPUs to speed up calculation. The purpose of phase-I is to identify the system bottleneck and estimate the potential of using GPUs in order to assist designing phase-II. In phase-II (June 2010 - summer 2013), we focus on reducing communication cost by fundamentally change the algorithm. The objective of this phase is to make the new algorithm scalable at large-scale. I am heavily involved in this project as the co-designer of algorithms and the primary developer.
In phase-I, we implement PLSQR with MPI and CUDA programming models. MPI is used for partitioning the big task into smaller sub-tasks and coordination between these sub-tasks. CUDA is used to drive GPUs to speed up calculation at individual sub-task. At MPI level, our contributions include: (1) decompose both matrix and vector to increase parallelism; (2) design a static load balancing strategy. At CUDA level, our contributions include: (1) utilize CUDA accelerated CUBLAS library and CUSPARSE library to compute major steps in LSQR; (2) optimize memory copy between host memory and device memory; (3) develop CUDA kernels to perform transpose SpMV without actually transposing the matrix in memory or preserving additional copy. PLSQR demonstrates good quality in mid-scale test at Keeneland, a CPU-GPU cluster at National Institute for Computational Sciences (NICS). PLSQR scales both in strong scaling test (1 to 60 GPUs) and weak scaling test (15 to 135 GPUs). The advantage of using GPUs is obvious. GPU implementation is on average 2x to 4x faster than corresponding CPU implementation. This work is published at 2012 International Conference of Computation Science.
But when applying PLSQR to large-scale test, communication become the dominating part instead of calculation. Communication heavily limits the scalability at large-scale. We realize this cannot be improved by merely improving implementation quality. So our group launch phase-II (SPLSQR) of the project to seek a fundamental change of the algorithm. SPLSQR addresses the computational challenges by utilizing special partitioning strategy and a computational algorithm that is based on the characteristics of the seismic datasets. In particular, SPLSQR contains a novel data decomposition strategy that treats different components of the matrix separately. SPLSQR results in an algorithm with scalable communication volume between a fixed and modest number of communication neighbors. Note that, this optimization technique is not just limited to seismic dataset. In fact, dataset of tomographic problem in other domains also has similar structures. Thus, SPLSQR can be expanded to other domains.
A preliminary implementation for SPLSQR using MPI was developed in early 2012. We tested it on Kraken at NICS. The new algorithm demonstrates good scalability from 360 to 19,600 CPU cores. Compared with PLSQR and implementation using PETSc library, SPLSQR reduces the communication time by a factor of 20.
There are still ample opportunities to further improve the scalability of SPLSQR. The communication still contains some redundant data. By further refining the data structure and transmitting only the necessary data, we expect to reduce the communication volume by a factor of two or three. Like phase-I, we will port the computational part of SPLSQR to GPUs. By previous experience, considerable speedup on calculation can be expected. In addition, two more powerful new supercomputers will be available for us to use: Blue Waters machine with GPU acceleration from National Center for Supercomputing Applications (NCSA) and Yellowstone machine from NCAR-Wyoming Supercomputer Center (NWSC). The further refined algorithm and these more powerful computing resources will enable us to conduct seismic experiment on hundreds of thousands of CPU cores and GPUs.
Dr. Liqiang Wang, Computer Science, University of Wyoming, University of Central Florida
Dr. He Huang, Computer Science, University of Wyoming
Dr. John Dennis, National Center for Atmospheric Research
Dr. Po Chen, Geology and Geophysics, University of Wyoming
Mr. Galen Arnold, National Center for Supercomputing Applications
Dr. En-Jui Lee, Geology and Geophysics, University of Wyoming
 He Huang, Liqiang Wang, En-Jui Lee, and Po Chen. An MPI-CUDA Implementation and Optimization for Parallel Sparse Equations and Least Squares (LSQR). In the 2012 International Conference on Computational Science (ICCS) (main track). Omaha, NE. Procedia Computer Science, Elsevier, 2012. (paper, slides, code, example)
 He Huang, John M. Dennis, Liqiang Wang and Po Chen, A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Tomographic Problems: A Case Study in Seismic Tomography. In the 2013 International Conference on Computational Science (ICCS) (main track). Barcelona, Spain. (paper, slides)