Gene Cooperman

Affiliated Faculty,  Electrical and Computer Engineering
Professor,  Khoury College of Computer Sciences

Contact

Office

  • 617.373.8686

Research Focus

Computer systems, high-performance computing, transparent checkpoint-restart, model checking

About

Gene Cooperman is a professor at the Khoury College of Computer Sciences and an affiliated faculty member at the College of Engineering at Northeastern University. He has worked in a series of interdisciplinary research areas, including applied mathematics, computational and symbolic algebra, numerical analysis, computing in high energy physics, bioinformatics, high-performance computing, and computer systems.  He has graduated 10 PhD students and has co-authored over 125 refereed publications.

The ongoing DMTCP project (Distributed MultiThreaded Checkpointing) supports transparent checkpointing (snapshots) with no modification to the target application binary. DMTCP extends transparent checkpoint support to external hardware/software environments like GPUs and network interconnects to support MPI for HPC.  Over 150 refereed publications document examples of using DMTCP.

The newest direction for DMTCP is to make it a standard for supercomputing and HPC. In collaboration with the DOE’s NERSC supercomputing center, the DMTCP project (including MANA for MPI and CRAC for CUDA) is being extended and validated for production use. This will be used on NERSC’s Perlmutter supercomputer (expected to become the #6 supercomputer in the world when fully installed). The functionality provided by DMTCP, MANA, and CRAC will enable scientists to execute long-running computations by using checkpoint-restart to chain together multiple allocation time slots. Currently, users are limited to a maximum allocation time slot of 48 hours. This showcase project will allow other HPC centers to also use this new technology.

Research Overview

Computer systems, high-performance computing, transparent checkpoint-restart, model checking

I lead the High Performance Computing Laboratory within the Khoury College of Computer Sciences.  The lab currently includes four PhD students (one in ECE/COE and three in CS/Khoury).

Transparent Checkpointing for HPC (funded by NSF, NERSC/DOE, and MemVerge) — Our DMTCP platform has been extended to include MANA (checkpointing for MPI) and CRAC (checkpointing for CUDA).  We are working closely with NERSC to integrate MANA into their production workload.

Selected Publications

  • Twinkle Jain and Gene Cooperman, CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM, Proc. of the Int. Conf. for High Performance Computing, Networking, Storage and Analysis (SC’20)
  • Rohan Garg, Gregory Price, and Gene Cooperman, MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing, Proc. of 28th Int. Symp. on High Performance Parallel and Distributed Computing (HPDC’19), Phoenix, AZ, USA, ACM, pp. 49–60, June, 2019
  • Kapil Arya, Rohan Garg, Artem Y. Polyakov and Gene Cooperman, Design and Implementation for Checkpointing of Distributed Resources using Process-level Virtualization. Proc. of IEEE Int. Conf. on Cluster Computing (Cluster’16), pp. 402–412, Taipei, Taiwan, IEEE Press, Sept. 2016.
  • Sunil Ahn, John Apostolakis, Makoto Asai, Daniel Brandt, Gene Cooperman, Gabriele Cosmo, Xin Dong, Andrea Dotti, Andrzej Nowak, and S.J. Yun, Geant4: Bringing Multi-Threaded Geant4 into Production”, Joint International Conference on Supercomputing in Nuclear Applications + Monte Carlo (SNA + MC 2013) (2014)04213, 8 pages, 2014
  • Jason Ansel, Kapil Arya and Gene Cooperman, DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop. 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’09), 12 pages, Rome, Italy. May 2009

Faculty

Apr 04, 2012

FY13 TIER 1 Award Recipients

27 COE faculty and affiliates were recipients of FY13 TIER 1 Interdisciplinary Research Seed Grants for 21 different research projects.

View All Related News