Amir Gholami

About

Amir Gholami is a research scientist in BAIR and Sky lab at UC Berkeley and co-director of the Pallas lab. He received his PhD from UT Austin, working on large scale machine learning, a research topic which received UT Austin’s best doctoral dissertation award in 2018. He is a Melosh Medal finalist, the recipient of Amazon Machine Learning Research Award in 2020, best student paper award in SC'17, Gold Medal in the ACM Student Research Competition, and best student paper finalist in SC’14. He was also part of the Nvidia team that for the first time made low precision neural network training possible (FP16), enabling more than 10x increase in compute power through tensor cores. Amir's current research focuses on large scale agentic systems.

Contact Email: "amirgh _at_ berkeley . edu".

Open Positions:

There is an internship opportunity for research in the area of Efficient AI Agents (the position requires the student to be enrolled in UC Berkeley). Please email me your CV and include your transcript if you are interested with the subject of "Efficient AI Agents".

Publications

Papers

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks,
Erdogan LE, Lee N, Kim S, Moon S, Furuta H, Anumanchipalli G, Keutzer K, Gholami A.
ICML, 2025 (Accepted).

ETS: Efficient Tree Search for Inference-Time Scaling,
Hooper C, Kim S, Moon S, Dilmen K, Maheswaran M, Lee N, Mahoney MW, Shao S, Keutzer K, Gholami A.
under review, 2025.

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache,
Tiwari R, Xi H, Tomar A, Hooper C, Kim S, Horton M, Najibi M, Mahoney MW, Keutzer K, Gholami A,
ICML, 2025 (Accepted).

Squeezed Attention: Accelerating Long Context Length LLM Inference,
Hooper C, Kim S, Mohammadzadeh H, Maheswaran M, Paik J, Mahoney MW, Keutzer K, Gholami A.,
ACL, 2025 (to appear).

Characterizing prompt compression methods for long context inference,
Jha S, Erdogan LE, Kim S, Keutzer K, Gholami A.,
Es-FoMo, ICML 2024.

Efficient and Scalable Estimation of Tool Representations in Vector Space,
Moon S, Jha S, Erdogan LE, Kim S, Lim W, Keutzer K, Gholami A,
under review, 2024.

TinyAgent: Function Calling at the Edge,
Erdogan LE, Lee N, Jha S, Kim S, Tabrizi R, Moon S, Hooper C, Anumanchipalli G, Keutzer K, Gholami A,
EMNLP, 2024.

Reliable edge machine learning hardware for scientific applications,
Tommaso Baldi et al.,
IEEE 42nd VLSI Test Symposium (VTS), 2024.

AI and Memory Wall,
A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, K. Keutzer,
IEEE Micro Journal, 2024.

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement,
N. Lee, T. Wattanawong, S. Kim, K. Mangalam, S. Shen, G. Anumanchipali, M. W. Mahoney, K. Keutzer, and A. Gholami,
ACL, 2024.

Towards 10 Million Context Length LLM Inference with KV Cache Quantization,
C.Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, S. Shao, K. Keutzer, and A. Gholami,
NeurIPS, 2024.

An LLM compiler for parallel function calling,
S. Kim, S. Moon, R. Tabrizi, N. Lee, M. W. Mahoney, K. Keutzer, and A. Gholami,
ICML, 2024.

SPEED: Speculative pipelined execution for efficient decoding
C. Hooper, S. Kim, H. Mohammadzadeh, H. Genc, K. Keutzer, A. Gholami, and S. Shao,
ENLSP Workshop at NeurIPS, 2023.

SqueezeLLM: Dense-and-Sparse Quantization
S. Kim*, C. Hooper*, A. Gholami*, Z. Dong, X. Li, S. Shen, M. Mahoney, K. Keutzer,
ICML, 2024.

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior
S. Subramanian, P. Harrington, K. Keutzer, W. Bhimji, D. Morozov, M. Mahoney, A. Gholami,
NeurIPS, 2023.

End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs
J. Campos, Z. Dong, J. Duarte, A. Gholami, M. Mahoney, J. Mitrevski, N. Tran,
Transactions on Reconfigurable Technology and Systems, 2024.

Speculative Decoding with Big Little Decoder
S. Kim*, K. Mangalam, J. Malik, M. Mahoney, A. Gholami, and K. Keutzer,
NeurIPS, 2023.

Full Stack Optimization of Transformer Inference: a Survey
S. Kim*, C. Hooper*, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. Mahoney, Y. Shao, and A. Gholami,
ASSYST Workshop, ISCA, 2023.

Adaptive Self-supervision Algorithms for Physics-informed Neural Networks
S Subramanian, R. Kirby, M. Mahoney, A. Gholami,
ECAI, 2023.

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
S. Kim*, A. Gholami*, A. Shaw, N. Lee, K. Mangalam, J. Malik, M. W. Mahoney, and K. Keutzer,
NeurIPS, 2022.

A Fast Post-Training Pruning Framework for Transformers
W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami,
NeurIPS, 2022 .

Applications and Techniques for Fast Machine Learning in Science
Frontiers in Big Data, 2022.

Characterizing possible failure modes in physics-informed neural networks
A. Krishnapriyan*, A. Gholami*, S. Zhe, R. Kirby, M. Mahoney,
NeurIPS, 2021.

Learned Token Pruning for Transformers
S. Kim*, S. Sheng*, D. Thorsley*, A. Gholami*, W. Kwon, J. Hassoun, K. Keutzer,
KDD, 2022.

Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition
S. Kim, A. Gholami, Z. Yao, A. Nrusimha, B. Zhai, T. Gao, M. W. Mahoney, K. Keutzer,
ICASSP, 2022.

A Survey of Quantization Methods for Efficient Neural Network Inference
A. Gholami*, S. Kim*, Z. Dong*, Z. Yao*, M. W. Mahoney, K. Keutzer,
Book Chapter: Low-Power Computer Vision: Improving the Efficiency of Artificial Intelligence, 2021.

Hessian-Aware Pruning and Optimal Neural Implant
S. Yu*, A. Gholami*, Z. Yao*, Z. Dong*, M. W. Mahoney, K. Keutzer,
WACV, 2022.

I-BERT: Integer-only BERT Quantization
S. Kim*, A. Gholami*, Z. Yao*, M. W. Mahoney, K. Keutzer,
ICML, 2021 (Long talk).

HAWQ-V3: Dyadic Neural Network Quantization
Z. Yao*, Z. Dong*, Z. Zheng*, A. Gholami*, E. Tan, J. Li, L. Yuan, Q. Huang, Y. Wang, M. W. Mahoney, K. Keutzer,
ICML, 2021.

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
Z. Yao*, A. Gholami*, S. Shen, M. Mustafa, M. Mahoney, K. Keutzer,
AAAI, 2021.

Boundary thickness and robustness in learning models
Y. Yang, R. Khanna, Y. Yu, A. Gholami, K. Keutzer, J. Gonzalez, K. Ramchandran, M. Mahoney,
NeurIPS, 2020.

PowerNorm: Rethinking Batch Normalization in Transformers
S. Shen, Z. Yao, A. Gholami, M. Mahoney, K. Keutzer,
ICML, 2020.

ZeroQ: A Novel Zero Shot Quantization Framework,
Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. Mahoney, K. Keutzer,
CVPR, 2020. [Code]

PyHessian: Neural Networks Through the Lens of the Hessian,
Z. Yao*, A. Gholami*, K. Keutzer, M. Mahoney,
IEEE BigData (Oral Presentation), 2020.
(also a Spotlight at ICML workshop on Beyond First-Order Optimization Methods in Machine Learning) [Code]

HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks,
Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. Mahoney, K. Keutzer,
NeurIPS, 2020.

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization,
P. Jain, A. Jain, A. Nrusimha , A. Gholami, P. Abbeel, K. Keutzer, I. Stoica, J. Gonzalez,
MLSys, 2020. [Code]

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT,
S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. Mahoney, K. Keutzer,
AAAI, 2020.

ANODEV2: A Coupled Neural ODE Evolution Framework,
T. Zhang*, Z. Yao*, A. Gholami*, K Keutzer, J. Gonzalez, G. Biros, M. Mahoney,
NeurIPS, 2019.

HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision,
Z. Dong*, Z. Yao*, A. Gholami*, K. Keutzer, M. Mahoney,
ICCV, 2019.

Inefficiency of K-FAC for Large Batch Size Training,
L. Ma, G. Montague, J. Ye, Z. Yao, A. Gholami, K. Keutzer, MW. Mahoney,
AAAI, 2020.

ANODE: Unconditionally AccurateMemory-Efficient Gradients for Neural ODEs,
A. Gholami, K. Keutzer, G. Biros,
IJCAI, 2019.

Trust region based adversarial attack on neural networks,
Z. Yao, A. Gholami, P. Xu, K. Keutzer, and M. Mahoney,
CVPR, 2019.

A Novel Domain Adaptation Framework for Medical Image Segmentation,
A. Gholami, S. Subramanian, V. Shenoy, N. Himthani, X. Yue, S. Zhao, P. Jin, G. Biros, K. Keutzer,
Lecture Notes in Computer Science (LNCS), Springer, 2018.

Simulation of glioblastoma growth using a 3D multispecies tumor model with mass effect,
S. Subramanian, A. Gholami, G. Biros,
Journal of Mathematical Biology (JMatBio), 2019.

Large Batch Size Training of Neural Networks with Adversarial Training and Second-Order Information,
Z. Yao*, A. Gholami*, D. Arfeen, R. Liaw, J. Gonzalez, K. Keutzer, M. Mahoney,
arxiv preprint, arxiv:1810.01021, 2018.

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent,
N. Golmant, N. Vemuri, Z. Yao, V. Feinberg, A. Gholami, K. Rothauge, M. Mahoney, J. Gonzalez,
arxiv preprint, arxiv:1811.12941, 2018.

Co-Design of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications,
K. Kwon, A. Amid, A. Gholami, B. Wu, K. Keutzer,
Design Automation Conference (DAC), 2018.

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries,
Z. Yao*, A. Gholami*, Q. Lei, K. Keutzer, M. Mahoney,
NeurIPS, 2018.

Integrated Model, Batch and Domain Parallelism in Training Neural Networks,
A. Gholami, A. Azad, P. Jin, K. Keutzer, A. Buluc,
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2018.

Shift: A zero flop, zero parameter alternative to spatial convolutions,
B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholami, J. Gonzalez, K. Keutzer,
CVPR (Spotlight talk), 2018.

Personalized Emotion Recognition by Personality-aware High-order Learning of Physiological Signals,
S. Zhao, A. Gholami, G. Ding, J. Han, K. Keutzer,
ACM Transactions on MultiMedia Computing, 2018.

CLAIRE: A distributed-memory solver for constrained large deformation diffeomorphic image registration,
A. Mang, A. Gholami, C. Davatzikos, G. Biros,
SIAM Journal on Scientific Computing, 2019.

Coupling brain-tumor biophysical models and diffeomorphic image registration,
K. Scheufele, A. Mang, A. Gholami, C. Davatzikos, G. Biros, and M. Mehl,
Computer Methods in Applied Mechanics and Engineering, 2019.

SIBIA-GlS: Scalable biophysics-based image analysis for glioma segmentation,
A. Mang, S. Tharakan A. Gholami, N. Himthani, S. Subramanian, J. Levitt, M. Azmat, K. Scheufele, M. Mehl, C. Davatzikos, B. Barth, and G. Biros,
The multimodal brain tumor image segmentation benchmark (BRATS), MICCAI, 2017.

A framework for scalable biophysics-based image analysis,
A. Gholami, A. Mang, K. Scheufele, C. Davatzikos, M. Mehl, and G. Biros,
Proceedings of ACM/IEEE SuperComputing Conference (SC), 2017.

PDE constrained optimization in medical image analysis,
A. Mang, A. Gholami, C. Davatzikos, and G. Biros,
Optimization and Engineering, 2017.

Distributed-memory large-deformation diffeomorphic 3D image registration,
A. Mang, A. Gholami, and G. Biros,
Proceedings of ACM/IEEE SuperComputing Conference (SC), 2016.

AccFFT: A library for distributed-memory FFT on CPU and GPU architectures,
A. Gholami, J. Hill, D. Malhotra, and G. Biros,
arxiv preprint, arxiv:1506.07933, 2018.

A volume integral equation Stokes solver for problems with variable coefficients,
D. Malhotra, A. Gholami, and G. Biros,
Proceedings of ACM/IEEE SuperComputing Conference (SC), 2014 (Best Student Paper Finalist).

An inverse problem formulation for parameter estimation of a reaction–diffusion model of low grade gliomas,
A. Gholami, A. Mang, and G. Biros,
Journal of mathematical biology, Vol. 72, pp 409-433, 2015.

FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube,
A. Gholami, D. Malhotra, H. Sundar and G. Biros. ,
SIAM Journal on Scientific Computing, Vol. 38 (3), 2016.

Workshops

Trace weighted Hessian-aware quantization,
Z. Dong, Z. Yao, D. Arfeen, Y. Cai, A. Gholami, M. Mahoney, and K. Keutzer,
Spotlight at NuerIPS'19 workshop on Beyond First-Order Optimization Methods in Machine Learning, 2019.

Parameter re-initialization through cyclical batch- scheduling,
N. Mu, Z. Yao, A. Gholami, K. Keutzer, and M. Mahoney,
SysML Workshop at NuerIPS, 2018.

SqueezeNext: Hardware-Aware Neural Network Design,
A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer,
ECV Workshop at CVPR, 2018.

Communication analysis of hybrid model and data parallelism in training neural networks,
Amir Gholami, Ariful Azad, Kurt Keutzer, and Aydin Buluc,
Deep Learning at Supercomputer Scale, NeurIPS, 2017.

Selected Talks

BAIR Robotics and Systems, Apr., 2022,
When It's Time You Had a Little Talk with Your Robot,
Monterey Data Workshop, Apr., 2022,
Rethinking Physics Informed Neural Networks.
Nvidia GTC Conference, Apr., 2021,
Systematic Neural Network Quantization, Nvidia GTC Conference.
Opening Keynote, Intel System Architecture Summit (ISAS), Feb., 2021,
Emerging AI Applications: Moving Beyond ResNet50 on ImageNet.
Google Research, Dec., 2020,
Systematic Quantization and Pruning for Efficient Neural Network Inference.

Keynote in NSF Cyberinfrastructure workshop, Feb., 2020,
An Integrated Approach for Efficient Neural Network Design, Training, and Inference.

UC Berkeley, RiseLab Retreat, Jan. 2020,
HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks.

UC Berkeley, BLISS Seminar, Oct. 2019,
Systematic Quantization of Neural Networks Through Second-Order Information.

Facebook, AI Systems Faculty Summit, Sep. 2019,
Efficient Neural Networks through Systematic Quantization.

BSTARS'19, Berkeley Statistics Department, Mar. 2019,
Neural Networks Through the Lens of the Hessian.

Berkeley Simons Institute, 5th Annual Industry Day, Feb. 2019,
ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs.

Simons Randomized Numerical Linear Algebra and Applications Workshop, Sep. 2018,
Large Scale Stochastic Training of Neural Networks.

Simons Data Science Finale Workshop, Dec. 2018,
Towards Robust Second-order Training of Neural Networks.

Simons Weekly Optimization Reading Group, Oct. 2018,
Second order optimization for convex and non-convex problems.

NERSC Data Seminar, Dec. 2018,
Beyond SGD: Robust Optimization and Second-Order Information for Large-Scale Training of Neural Networks .

Stanford, CME 510: Linear Algebra and Optimization Seminar, Nov. 2018,
Large-scale training of Neural Networks .

UCSF Radiology Department, Oct. 2018 ,
A Domain Adaptation framework for Neural Network Based Medical Image Segmentation.

Intel AI Meeting, Oct. 2018,
Autonomous Driving Challenges in Computer Vision Research.

Facebook AI Research, Sep. 2018,
Challenges for Distributed Training of Neural Networks.

Microsoft Research, Aug. 2018,
Large Scale Training of Neural Networks .

Berkeley Scientific Computing and Matrix Computations Seminar, Sep. 2017,
A Framework for Scalable Biophysics-based Image Analysis .

Stanford, ICME Star Talk Series, 2017,
Fast algorithms for inverse problems with parabolic pde constraints with application to biophysics-based image analysis,

SIAM Minisymposium on Imaging Sciences, Albuquerque, NM, USA, 2016,
On preconditioning Newton method for PDE constrained optimization problems.

13th U.S. National Congress on Computational Mechanics, San Diego, CA, USA, 2015,
Challenges for exascale scalability of elliptic solvers using a model Poisson solver and comparing state-of-the art methods.

SIAM CSE Minisymposium, Salt Lake, Utah, USA, 2015,
Parameter estimation for malignant brain tumors.

12th U.S. National Congress on Computational Mechanics, Raleigh, NC, USA, 2013,
A numerical algorithm for biophysically-constrained parameter estimation for tumor modeling and data assimilation with medical images.

SIAM Annual Meeting, San Diego, CA, USA, 2013,
Image-driven inverse problem for estimating initial distribution of brain tumor modeled by advection-diffusion-reaction equation.

Patents

Dynamic directional rounding,
A. Fit-Florea, A. Gholami, B. Ginsburg, and P. Davoodi.
Approved by Nvidia Patent Office (US patent pending), 2018.

Tensor processing using low precision format,
B. Ginsburg, S. Nikolaev, A. Kiswani, H. Wu, A. Gholami, S. Kierat, M. Houston, and A. Fit-Flores.
United States patent application US 15/624,577. 2017 Dec 28.

High performance inplace transpose operations,
A. Gholami and B. Natarajan,
United States patent US 10,067,911, 2018.

About

Open Positions:

Recent News

Students

Current Students:

Alumni (Gone but not forgotten):

Publications

Papers

Workshops

Selected Talks

Patents