|Faster Neural Network Training, Algorithmically
In my current capacity as Chief Scientist at MosaicML, I am developing ways to change the neural network training algorithm to improve the efficiency of training. The cost of training state-of-the-art neural networks is increasing exponentially, and hardware and compiler improvements alone are insufficient to counterbalance this trend. Instead, I believe we need to fundamentally change the underlying training algorithms. Training is an approximate computing problem; there is nothing sacred about the math or training recipes we use today. Instead, this line of work leverages empirical analysis of the training dynamics of real-world networks to change the math behind training in ways that improve efficiency without affecting quality.
At MosaicML, we have developed dozens of speedup methods that improve the efficieny of training standard models for computer vision and natural language processing. All of these methods are available open-source in our Composer PyTorch trainer. We have written a description of each speedup method in the Composer documentation. You can interactively explore the results of applying these speedup methods to training standard benchmarks in the MosaicML Explorer. Our best recipes speedup ResNet-50 on ImageNet by 7x, DeepLabv3 on ADE20K by 5x, BERT Pre-Training by 2x, and GPT Language Modeling by 2x while maintaining the same quality as the baselines.
For an overview of our research approach and how we evaluate speedups at MosaicML, you can see my blog post on the subject.
|The Lottery Ticket Hypothesis
My main line of research during my PhD was on my lottery ticket hypothesis. This line of research focuses on understanding how large neural networks need to be to train in practice. We have long known that we can make neural networks much smaller after they have been trained. In this line of work, I showed that they can be equally small for much or all of training. This research has revealed new insights into how neural networks learn and offered opportunities for practical efficiency improvements.
|On the Predictability of Pruning Across Scales
|Pruning Neural Networks at Initialization: Why are we missing the mark?
|Studying the Consistency and Composability of Lottery Tickets
|Reconciling Sparse and Structured Pruning: A Study of Block Sparsity
|Examining the Role of Normalization in the Lottery Ticket Hypothesis
|The Lottery Ticket Hypothesis for Pre-Trained BERT Networks
|Linear Mode Connectivity and the Lottery Ticket Hypothesis
|Comparing Fine-Tuning and Rewinding in Neural Network Pruning
|The Early Phase of Neural Network Training
|What is the State of Neural Network Pruning?
|Stabilizing the Lottery Ticket Hypothesis/The LTH at Scale
|The Lottery Ticket Hypothesis
|ICLR Best Paper
My open-source library for conducting research on the lottery ticket hypothesis is called OpenLTH. This is my current working codebase for this line of research. It is written for PyTorch, and it includes the components necessary to reproduce the main experiments from my work on the lottery ticket hypothesis. For an updated version of the codebase that supports experiments on pruning neural networks at initialization and early in training, see the supplemental materials accompanying Pruning Neural Networks at Initialization: Why are we missing the mark? on OpenReview.
|Science of Deep Learning
More broadly, I am interested in understanding the behavior of practical neural networks empirically. For all the extraordinary advances neural networks have enabled in recent years, our understanding of how and what they learn remains limited. I study these questions from a scientific perspective, posing hypotheses and performing large-scale experiments to empircally evaluate them. I believe we can improve our knowledge of neural networks by scientifically examining how they behave in practice.
|What Can Linear Interpolation of Neural Network Loss Landscapes Tell Us?
|Training BatchNorm and Only BatchNorm
|Revisiting "Qualitatively Characterizing Neural Network Loss Lanscapes"
|Trade-offs of Local SGD at Scale
|Are All Negatives Created Equal in Contrastive Instance Discrimination?
|Dissecting Pruned Neural Networks
During my year as Staff Technologist at the Center on Privacy and Technology at Georgetown Law, I studied police use of face recognition technology. I also collaborated with Prof. Paul Ohm on both scholarship and teaching. In the years since, I have served as an invited expert at the OECD, contributing to the OECD AI Principles and follow-up work.
|Computer Programming for Lawyers
|Course & Textbook
|Florida Law Review
|The Perpetual Lineup: Unregulated Police Face Recognition in America
|How Russia's New Facial Recognition App Could End Online Anonymity
|Facial-Recognition Software Might Have a Racial Bias Problem
During the early part of my PhD, I studied cryptography and technology policy. My research on using zero-knowledge arguments to faciliate novel tradeoffs between secrecy and accountability in the court system (Practical Accountability of Secret Processes) was published in Usenix Security 2018.
During my master's degree, I studied programming language theory with Prof. David Walker. My thesis work was on type-directed program synthesis, and I published an extended version of the thesis (Example-Directed Synthesis: A Type-Theoretic Interpretation) in POPL 2016.