Fast Neural Network Performance Estimator

Prior work has demonstrated that a good training speed estimator (TSE) plus classic optimizaiton techniques such as Bayesian Optimizaiton or Evolutionary Algorithms can help us to find optimal network architectures.

Speedy Performance Estimation for Neural Architecture Search

The idea of the TSE estimator is the following:

We use only train the network for a small number of epochs, and test our network’s performance
At epoch $t=T-E+1$, where $E$ is basically is the first few epochs that are treated as a burn-in
$TSE=\sum^{T}{t=T-E+1}[\frac{1}{B}\sum^{B}{i=1}L(f_\theta(X),y)]$
There is, of course, also a moving average version with a discounting factor $\gamma$
$TSE_{EMA}=\sum^{T}{t=1}\gamma^{T-t}[\frac{1}{B}\sum^{B}{i=1}L(f_\theta(X),y)]$

Notice in TSE, the loss is summed across $B$ that means all points in the dataset are traversed.

Bayesian Optimization for Hyperparameter Tuning

GitHub - rubinxin/TSE

Screen Shot 2022-09-30 at 10.57.10 PM.png

Previous BO approaches use GP to optimize both global and layer-wise hyperparameters in Neural Networks, all of these hyperparamters are discrete points (eg. layers can only be {1, 2, 3}).

Motivated by the fact that many hyperparamters are better to be tuned with a function. Let’s just use learning rate as an example, we are interested in

Instead of searching for learning rate from a set $\{0.1, 0.01 ...\}$, we search from a configurable polynomial, where t is the number of epochs

$$ lr = (a_0t+b_0)(a_1t+b_1)(a_2t+b_2) $$

Instead of search this learning rate for the whole network, we search for each layer
Instead of searching and wait for feedback for full-training, we use the following estimators
- TSE
- TSE-EMA
- Modified TSE
  - $MTSE=\sum^{T}{t=T-E+1}[\frac{1}{B'}\sum^{B'}{i=1}L(f_\theta(X'),y')]$
  - we sample only $B'<<B$ samples from the dataset.
- MTSE-EMA