Fast Neural Network Performance Estimator
Prior work has demonstrated that a good training speed estimator (TSE) plus classic optimizaiton techniques such as Bayesian Optimizaiton or Evolutionary Algorithms can help us to find optimal network architectures.
Speedy Performance Estimation for Neural Architecture Search
The idea of the TSE estimator is the following:
- We use only train the network for a small number of epochs, and test our network’s performance
- At epoch $t=T-E+1$, where $E$ is basically is the first few epochs that are treated as a burn-in
- $TSE=\sum^{T}{t=T-E+1}[\frac{1}{B}\sum^{B}{i=1}L(f_\theta(X),y)]$
- There is, of course, also a moving average version with a discounting factor $\gamma$
- $TSE_{EMA}=\sum^{T}{t=1}\gamma^{T-t}[\frac{1}{B}\sum^{B}{i=1}L(f_\theta(X),y)]$
Notice in TSE, the loss is summed across $B$ that means all points in the dataset are traversed.
Bayesian Optimization for Hyperparameter Tuning
GitHub - rubinxin/TSE
Previous BO approaches use GP to optimize both global and layer-wise hyperparameters in Neural Networks, all of these hyperparamters are discrete points (eg. layers can only be {1, 2, 3}).
Motivated by the fact that many hyperparamters are better to be tuned with a function. Let’s just use learning rate as an example, we are interested in
- Instead of searching for learning rate from a set $\{0.1, 0.01 ...\}$, we search from a configurable polynomial, where t is the number of epochs
$$
lr = (a_0t+b_0)(a_1t+b_1)(a_2t+b_2)
$$
- Instead of search this learning rate for the whole network, we search for each layer
- Instead of searching and wait for feedback for full-training, we use the following estimators
- TSE
- TSE-EMA
- Modified TSE
- $MTSE=\sum^{T}{t=T-E+1}[\frac{1}{B'}\sum^{B'}{i=1}L(f_\theta(X'),y')]$
- we sample only $B'<<B$ samples from the dataset.
- MTSE-EMA