Amir Gholami is a postdoctoral research fellow in BAIR Lab. He received his PhD from UT Austin, working on bio-physics based image analysis, a research topic which received UT Austin’s best doctoral dissertation award in 2018 (thesis can be found here). He is a Melosh Medal finalist, recipient of best student paper award in SC'17, Gold Medal in the ACM Student Research Competition, as well as best student paper finalist in SC’14. His current research includes large scale training of Neural Networks, stochastic second-order methods, and robust optimization (resume).

Contact Email: "amirgh _at_ eecs . berkeley . edu".

Recent News

  • 03/01/19: Our Trust Region paper has been accepted to CVPR'19!
  • 02/28/19: Will be giving a talk in Fifth Annual Industry Day at Simons Institute
  • 11/06/18: Three papers accepted in NeurIPS'18 (one main conference and two workshops)
  • 11/01/18: I will be giving a talk in Stanford CME-510 lecture series
  • 03/30/18: Just learned that my PhD thesis has won UT Austin's 2018 Outstanding Disseration Award. Thanks George for your great mentorship
  • 03/28/18: We have released SqueezeNext, the smallest neural network desgined so far (112x smaller than AlexNet)
  • 03/05/18: Bichen's paper is selected for spotlight in CVPR'18
  • 02/26/18: Selected as a finalist for Robert J. Melosh Medal. Very excited to give the Melosh Medal talk at Duke University
  • 02/08/18: Will be giving a lecture in CS267 on GPUs [Watch Here]
  • 11/21/17: Our paper won the Best Student Paper award at SC'17!



In this work, we introduce SqueezeNext, a new family of neural network architectures. SqueezeNext matches AlexNet's accuracy on the ImageNet with 112x fewer parameters, and its deeper variant exceeds VGG-19's accuracy with only 4.4 Million parameters, (31x smaller). SqueezeNext also achieves better top-5 classification accuracy with 1.3x fewer parameters as compared to MobileNet, while avoiding depthwise-separable convolutions that have poor arithmetic intensity. Using hardware simulation results for power and inference speed on an embedded system, guided us to optimize the baseline model that are 2.59x/8.26x faster and 2.25x/7.5x more energy efficient as compared to SqueezeNet/AlexNet without any accuracy degradation. For details please see this paper.

Landscape of Neural Network Loss

Characterizing the generalization performance of Neural Network at different points in the optimization space is an active area of research. In particular, the network's performance highly depends on the mini-batch size used for training. But what is different in the quality of the solution for large and small batch size that leads to this difference? We study this through the lens of the Hessian operator and show an interesting interleaved connection with robustness of the Neural Network and mini-batch size. For details please see this paper.

Multi-Modal Brain Segmentation

Segmenting a tumor-bearing image, is the task of decomposing the image into disjoint regions. We present a framework for fully automatic segmentation of brain MRI bearing gliomas, which includes three main steps: (1) preprocessing of the input MRI to normalize intensities and transport them in a common atlas space; (2) using supervised machine learning to create initial segmentation and probability maps for the target classes (whole tumor, edema, tumor core, and enhancing tumor); (3) combining these probabilities with an atlas-based segmentation algorithm in which we use a tumor growth model to improve on the segmentation and probability maps from the supervised learning scheme. The result of this work will be presented in MICCAI 2017.

(image from Wikipedia)

Half Precision Training

I worked on this project during my internship at NVIDIA. The goal was to perform the whole training pipeline using half-float precision. This is very challenging due to the limited range of expressible numerical values in half-precision. The limitted precision, severeley affect the vanishing and exploding gradient problem in Neural Networks. Existing approaches, included use of stochastic rounding, which even for shallow networks cannot achieve the baseline accuracy. We developed a novel approach that achieves same accuracy as the baseline, with all the calculations and storage in half-float. We successfully tested the method on deep networks such as AlexNet and GoogLeNet. This work has resulted in a pending patent application.

Parallel Image Registration

Image registration is a process in which a mapping from a reference image to a target image is sought. It is key in many different applications ranging from medical imaging to machine learning. We have develoepd a state-of-the-art parallel registration solver that has been scaled up to 8,192 cores, and have been able to solve a record 3D image registration problem with 200 billion unknowns in less than 4 minutes. The code that we have developed is based on AccFFT along with a novel parallel high-order interpolation kernel. The result of this work will appear in SC'17( best student paper finalist [pdf]).

Accelerated FFT Library

Accelerated FFT (AccFFT) is a new parallel FFT library for computing distributed Fast Fourier Transforms on GPU and CPU architectures. The library has been designed with the goal of achieving maximum performance, without making the user interface complicated. AccFFT supports parallel FFTs distributed with slab or pencil decomposition for both CPU and GPU architectures. The library's scalability has been tested upto 131K CPU cores, and upto 4K GPUs [pdf].

Novel Stokes Solver using FMM

Stokes equation is one of the most important equations derived from Navier-Stokes. Numerical solutions and discretization of the Stokes equation is challenging. For instance, one cannot use arbitrary discretization spaces for velocity and pressure. Moreover, it is an elliptic but indefinite problem, which further complicates the construction of fast linear algebraic solvers and preconditioners, especially for problems with highly variable coefficients or high-order discretizations. We are using a novel adaptive fast multipole method (pvfmm), which uses an integral formulation scheme that can circumvent most of the difficulties with the Stokes equation. Compared to finite element methods, our formulation decouples the velocity and pressure, and generates fields that are by construction divergence free [pdf].


Massively Parallel Poisson Solvers

The need for large scale parallel solvers for elliptic partial differential equations (PDES) pervades across a spectrum of problems with resolution requirements that cannot be accommodated on current systems. Poisson solvers must scale to trillions of unknowns. Example of methods that scale well are the FFT (based on spectral discretizations), the Fast Multipole Method, and multigrid methods (for stencil-based discretizations). We have benchmarked these methods and compared their parallel efficiency as well as the corresponding cost per unknowns for different test cases. FFT is tested with p3dfft, FMM with pvfmm, AMG with ML package, and GMG with an in house code [pdf].

Brain Tumor Inverse Problem

Gliomas are tumors that arise from Glial cells in the brain. They account for 29% of all brain and central nervous system (CNS) tumors, and 80% of all malignant tumors out of about 60,000 cases diagnosed each year in the United States. Despite advances in surgery, chemo/radio therapy, the median survival rate of high grade Gliomas has remained about one year in the past 30 years. One of the key parameters in increasing the survival rate of patients is how well the tumor invasion boundaries are detectable. With the current imaging technologies only the bulk of the tumor abnormalities, can be detected, and the infiltrated tumor cells get masked. I am trying to approximate the extent of tumor infiltration by coupling the imaging data with tumor growth dynamics [pdf].



  • L. Ma, G. Montague, J. Ye, Z. Yao, A. Gholami, K. Keutzer, M. Mahoney Inefficiency of KFAC for Large Batch Size Training, arxiv:1903.06237 [pdf].

  • A. Gholami, K. Keutzer, G. Biros ANODE: Unconditionally AccurateMemory-Efficient Gradients for Neural ODEs, arxiv:1902.10298 [pdf].

  • Z. Yao, A. Gholami, P. Xu, K. Keutzer, and M. Mahoney Trust region based adversarial attack on neural networks, arxiv:1812.06371 (Accepted in CVPR'19) [pdf].

  • A. Gholami, S. Subramanian, V. Shenoy, N. Himthani, X. Yue, S. Zhao, P. Jin, G. Biros, K. Keutzer A Novel Domain Adaptation Framework for Medical Image Segmentation, Lecture Notes in Computer Science (LNCS), Springer (arxiv:1810.05732) [pdf].

  • S. Subramanian, A. Gholami, G. Biros Simulation of glioblastoma growth using a 3D multispecies tumor model with mass effect, arxiv:1810.05370 [pdf].

  • Z. Yao, A. Gholami, K. Keutzer, M. Mahoney. Large Batch Size Training of Neural Networks with Adversarial Training and Second-Order Information, arxiv:1810.01021 [pdf].

  • N. Golmant, N. Vemuri, Z. Yao, V. Feinberg, A. Gholami, K. Rothauge, M. Mahoney, J. Gonzalez On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent, arxiv:1811.12941 (under review) [pdf].

  • K. Kwon, A. Amid, A. Gholami, B. Wu, K. Keutzer Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications, Design Automation Conference (DAC) 2018 [pdf].

  • Z. Yao, A. Gholami, Q. Lei, K. Keutzer, M. Mahoney. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries, arXiv:1802.08241 (accepted in NeurIPS'18) [pdf].

  • A. Gholami, A. Azad, P. Jin, K. Keutzer, A. Buluc. Integrated Model, Batch and Domain Parallelism in Training Neural Networks, ACM Symposium on Parallelism in Algorithms and Architectures(SPAA’18) [pdf].

  • B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholami, J. Gonzalez, K. Keutzer. Shift: A zero flop, zero parameter alternative to spatial convolutions, CVPR 2018 (Spotlight talk) [pdf].

  • S. Zhao, A. Gholami, G. Ding, J. Han, K. Keutzer. Personalized Emotion Recognition by Personality-aware High-order Learning of Physiological Signals, ACM Transactions on MultiMedia Computing (Accepted).

  • A. Mang, A. Gholami, C. Davatzikos, G. Biros CLAIRE: A distributed-memory solver for constrained large deformation diffeomorphic image registration (in review) [pdf].

  • K. Scheufele, A. Mang, A. Gholami, C. Davatzikos, G. Biros, and M. Mehl. Coupling brain-tumor biophysical models and diffeomorphic image registration, Computer Methods in Applied Mechanics and Engineering, 2019 [pdf].

  • A. Mang, S. Tharakan A. Gholami, N. Himthani, S. Subramanian, J. Levitt, M. Azmat, K. Scheufele, M. Mehl, C. Davatzikos, B. Barth, and G. Biros. SIBIA-GlS: Scalable biophysics-based image analysis for glioma segmentation. The multimodal brain tumor image segmentation benchmark (BRATS), MICCAI, 2017. [pdf].

  • A. Gholami, A. Mang, K. Scheufele, C. Davatzikos, M. Mehl, and G. Biros. A framework for scalable biophysics-based image analysis Proceedings of ACM/IEEE SuperComputing Conference (SC'17), 2017 (Best Student Paper) [pdf].

  • A. Mang, A. Gholami, C. Davatzikos, and G. Biros. PDE constrained optimization in medical image analysis Optimization and Engineering (accepted), 2017

  • A. Mang, A. Gholami, and G. Biros. Distributed-memory large-deformation diffeomorphic 3D image registration Proceedings of ACM/IEEE SuperComputing Conference (SC16), 2016 [pdf].

  • A. Gholami, J. Hill, D. Malhotra, and G. Biros. AccFFT: A library for distributed-memory FFT on CPU and GPU architectures. (submitted), 2015 [pdf].

  • D. Malhotra, A. Gholami, and G. Biros. A volume integral equation Stokes solver for problems with variable coefficients. Proceedings of ACM/IEEE SuperComputing Conference (SC14), 2014 (Best Student Paper Finalist) [pdf].

  • A. Gholami, A. Mang, and G. Biros. An inverse problem formulation for parameter estimation of a reaction–diffusion model of low grade gliomas. Journal of mathematical biology, Vol. 72, pp 409-433, 2015. [pdf].

  • A. Gholami, D. Malhotra, H. Sundar and G. Biros. FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube. SIAM Journal on Scientific Computing, Vol. 38 (3), 2016 [pdf].


  • N. Mu, Z. Yao, A. Gholami, K. Keutzer, and M. Mahoney Parameter re-initialization through cyclical batch- scheduling, SysML Workshop at NuerIPS'18 [pdf].

  • A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer SqueezeNext: Hardware-Aware Neural Network Design, ECV Workshop at CVPR'18 [pdf].

  • Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluc, Communication analysis of hybrid model and data parallelism in training neural networks, Deep Learning at Supercomputer Scale, NIPS, 2017


  • A. Gholami Fast algorithms for inverse problems with parabolic pde constraints with application to biophysics-based image analysis. Stanford, ICME Star Talk Series, 2017.

  • A. Gholami and G. Biros. On preconditioning Newton method for PDE constrained optimization problems. Minisymposium at SIAM Conference on Imaging Sciences, Albuquerque, NM, USA, 2016.

  • A. Gholami and G. Biros. Challenges for exascale scalability of elliptic solvers using a model Poisson solver and comparing state-of-the art methods. 13th U.S. National Congress on Computational Mechanics, San Diego, CA, USA, 2015.

  • A. Gholami and G. Biros. Parameter estimation for malignant brain tumors. Minisymposium at SIAM CSE, Salt Lake, Utah, USA, 2015.

  • A. Gholami and G. Biros. A numerical algorithm for biophysically-constrained parameter estimation for tumor modeling and data assimilation with medical images. 12th U.S. National Congress on Computational Mechanics, Raleigh, NC, USA, 2013.

  • A. Gholami and G. Biros. Image-driven inverse problem for estimating initial distribution of brain tumor modeled by advection-diffusion-reaction equation. SIAM Annual Meeting, San Diego, CA, USA, 2013.


  • A. Gholami and G. Biros. AccFFT: A New Parallel FFT Library for CPU and GPU Architectures Poster at ACM/IEEE SuperComputing Conference (SC15), Austin, TX, 2015

  • A. Gholami and G. Biros. Inverse problem method for parameter estimation of a reaction-diffusion model of low grade gliomas Poster at 13th U.S. National Congress on Computational Mechanics, San Diego, CA, USA, 2015

  • A. Gholami and G. Biros. A numerical algorithm for biophysically-constrained parameter estimation for tumor modeling and data assimilation with medical images Poster at 12th U.S. National Congress on Computational Mechanics, Raleigh, NC, USA, 2013.

  • A. Gholami and G. Biros. Image-driven inverse algorithms for brain tumor modeling and diagnosis. ASME Congress and Exposition, IMECE2012,Houston, USA, 2012.

  • A. Gholami and G. Biros. Fast algorithms for inverse problems of reaction-diffusion-advection equations SIAM Annual Meeting, Minneapolis, USA, 2012.


  • B. Ginsburg, S. Nikolaev, A. Kiswani, H. Wu, A. Gholami, S. Kierat, M. Houston, and A. Fit-Flores. Tensor processing using low precision format, US Patent Pending, 2017.

  • A. Gholami and B. Natarajan. A novel high performance inplace transpose algorithm. US Patent Pending, 2017.

  • A. Gholami, R. Hosseini, M. Nabil, and M. H. Samadinia. Pool boiling cooling system. Iran Industrial Property Office, 68033, 2010.
Copyright © Amir Gholami 2014-2018