Faster Neural Network Training, Algorithmically

In my current capacity as Chief Scientist at MosaicML, I am developing ways to change the neural network training algorithm to improve the efficiency of training. The cost of training state-of-the-art neural networks is increasing exponentially, and hardware and compiler improvements alone are insufficient to counterbalance this trend. Instead, I believe we need to fundamentally change the underlying training algorithms. Training is an approximate computing problem; there is nothing sacred about the math or training recipes we use today. Instead, this line of work leverages empirical analysis of the training dynamics of real-world networks to change the math behind training in ways that improve efficiency without affecting quality.

At MosaicML, we have developed dozens of speedup methods that improve the efficieny of training standard models for computer vision and natural language processing. All of these methods are available open-source in our Composer PyTorch trainer. We have written a description of each speedup method in the Composer documentation. You can interactively explore the results of applying these speedup methods to training standard benchmarks in the MosaicML Explorer. Our best recipes speedup ResNet-50 on ImageNet by 7x, DeepLabv3 on ADE20K by 5x, BERT Pre-Training by 2x, and GPT Language Modeling by 2x while maintaining the same quality as the baselines.

For an overview of our research approach and how we evaluate speedups at MosaicML, you can see my blog post on the subject.

The Lottery Ticket Hypothesis

My main line of research during my PhD was on my lottery ticket hypothesis. This line of research focuses on understanding how large neural networks need to be to train in practice. We have long known that we can make neural networks much smaller after they have been trained. In this line of work, I showed that they can be equally small for much or all of training. This research has revealed new insights into how neural networks learn and offered opportunities for practical efficiency improvements.

2021 On the Predictability of Pruning Across Scales ICML
Pruning Neural Networks at Initialization: Why are we missing the mark? ICLR
Studying the Consistency and Composability of Lottery Tickets ICLR Workshop
Reconciling Sparse and Structured Pruning: A Study of Block Sparsity ICLR Workshop
Examining the Role of Normalization in the Lottery Ticket Hypothesis ICLR Workshop
2020 The Lottery Ticket Hypothesis for Pre-Trained BERT Networks NeurIPS
Linear Mode Connectivity and the Lottery Ticket Hypothesis ICML
Comparing Fine-Tuning and Rewinding in Neural Network Pruning ICLR Oral
The Early Phase of Neural Network Training ICLR
What is the State of Neural Network Pruning? MLSys
2019 Stabilizing the Lottery Ticket Hypothesis/The LTH at Scale Arxiv
The Lottery Ticket Hypothesis ICLR Best Paper

My open-source library for conducting research on the lottery ticket hypothesis is called OpenLTH. This is my current working codebase for this line of research. It is written for PyTorch, and it includes the components necessary to reproduce the main experiments from my work on the lottery ticket hypothesis. For an updated version of the codebase that supports experiments on pruning neural networks at initialization and early in training, see the supplemental materials accompanying Pruning Neural Networks at Initialization: Why are we missing the mark? on OpenReview.

Science of Deep Learning

More broadly, I am interested in understanding the behavior of practical neural networks empirically. For all the extraordinary advances neural networks have enabled in recent years, our understanding of how and what they learn remains limited. I study these questions from a scientific perspective, posing hypotheses and performing large-scale experiments to empircally evaluate them. I believe we can improve our knowledge of neural networks by scientifically examining how they behave in practice.

2022 What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us? ICML
2021 Training BatchNorm and Only BatchNorm ICLR
2020 Revisiting "Qualitatively Characterizing Neural Network Loss Lanscapes" NeurIPS Workshop
Trade-offs of Local SGD at Scale NeurIPS Workshop
Are All Negatives Created Equal in Contrastive Instance Discrimination? Arxiv
2019 Dissecting Pruned Neural Networks ICLR Workshop
Technology Policy

During my year as Staff Technologist at the Center on Privacy and Technology at Georgetown Law, I studied police use of face recognition technology. I also collaborated with Prof. Paul Ohm on both scholarship and teaching. In the years since, I have served as an invited expert at the OECD, contributing to the OECD AI Principles and follow-up work.

In Progress Computer Programming for Lawyers Course & Textbook
2018 Desirable Inefficiency Florida Law Review
2016 The Perpetual Lineup: Unregulated Police Face Recognition in America Investigative Report
How Russia's New Facial Recognition App Could End Online Anonymity The Atlantic
Facial-Recognition Software Might Have a Racial Bias Problem The Atlantic
Earlier Research

During the early part of my PhD, I studied cryptography and technology policy. My research on using zero-knowledge arguments to faciliate novel tradeoffs between secrecy and accountability in the court system (Practical Accountability of Secret Processes) was published in Usenix Security 2018.

During my master's degree, I studied programming language theory with Prof. David Walker. My thesis work was on type-directed program synthesis, and I published an extended version of the thesis (Example-Directed Synthesis: A Type-Theoretic Interpretation) in POPL 2016.