# 常见优化算法 (caffe和tensorflow对应参数)

2016-12-06

## 常见算法

### SGD

`x+= -learning_rate*dx`

### Momentum

Momentum可以使SGD不至于陷入局部鞍点震荡，同时起到一定加速作用。
Momentum最开始有可能会偏离较远(overshooting the target)，但是通常会慢慢矫正回来。

```v = mu*v - learning_rate*dx
x+= v```

### Nesterov momentum

vt=&mu;vt?1?&epsilon;▽f(&theta;t?1+&mu;vt?1)
&theta;t=&theta;t?1+vt

vt=&mu;vt?1+&epsilon;▽f(?t?1)
?t?1=?t?1?&mu;vt?1+(1+&mu;)vt

```v_prev = v
v = mu*v-learning_rate*dx
x += -mu*v_prev+(1+mu)*v ```

```cache += dx**2
x += -learning_rate*dx/(np.sqrt(cache)+1e-7)```

### RMSProp

```cache = decay_rate*cache + (1-decay_rate)*dx**2
x += -learning_rate*dx/(np.sqrt(cache)+1e-7)```

```m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += -learning_rate*m / (np.sqrt(v)+1e-7)```

```m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
mb = m/(1-beta1**t)   # t is step number
vb = v/(1-beta2**t)
x += -learning_rate*mb / (np.sqrt(vb)+1e-7)```

mb和vb起到最开始的时候warm up作用，t很大之后(1-beta1**t) =1

### Second Order optimization methods

second-order taylor expansion:
J(&theta;)&asymp;J(&theta;0)+(&theta;?theta0)T+12(&theta;?&theta;0)TH(&theta;?&theta;0)
&theta;?=&theta;0?H?1▽&theta;J(&theta;0)

Quasi_newton methods (BFGS) with approximate inverse Hessian matrix L-BFGS (limited memory BFGS)
Does not form/store the full inverse Hessian.
Usually works very well in full batch, deterministic mode

# tensorflow 不同优化算法对应的参数

### Momentum

optimizer = tf.train.MomentumOptimizer(lr, 0.9)

### RMSProp

optimizer = tf.train.RMSPropOptimizer(0.001, 0.9)

train_op = optimizer.minimize(loss)

# Caffe 不同优化算法参数

caffe的优化需要在solver.prototxt中指定相应的参数

### type代表的是优化算法

* Stochastic Gradient Descent (type: “SGD”),
* Nesterov&rsquo;s Accelerated Gradient (type: “Nesterov”) and
* RMSprop (type: “RMSProp”)

### SGD

```base_lr: 0.01
lr_policy: "step"    # 也可以使用指数，多项式等等
gamma: 0.1
stepsize: 1000
max_iter: 3500
momentum: 0.9```

```net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
solver_mode: GPU
delta: 1e-6```

```net: "examples/mnist/mnist_autoencoder.prototxt"
test_state: { stage: &#39;test-on-train&#39; }
test_iter: 500
test_state: { stage: &#39;test-on-test&#39; }
test_iter: 100
test_interval: 500
test_compute_loss: true
base_lr: 0.01
lr_policy: "fixed"
display: 100
max_iter: 65000
weight_decay: 0.0005
snapshot: 10000
# solver mode: CPU or GPU
solver_mode: GPU

### Nesterov

```base_lr: 0.01
lr_policy: "step"
gamma: 0.1
weight_decay: 0.0005
momentum: 0.95
type: "Nesterov"```

```train_net: "nin_train_val.prototxt"
base_lr: 0.001
###############
##### step:base_lr * gamma ^ (floor(iter / stepsize))
#lr_policy: "step"
#gamma: 0.1
#stepsize: 25000
##### multi-step:
#lr_policy: "multistep"
#gamma: 0.5
#stepvalue: 1000
#stepvalue: 2000
#stepvalue: 3000
#stepvalue: 4000
#stepvalue: 5000
#stepvalue: 10000
#stepvalue: 20000
###### inv:base_lr * (1 + gamma * iter) ^ (- power)
# lr_policy: "inv"
# gamma: 0.0001
# power: 2
##### exp:base_lr * gamma ^ iter
# lr_policy: "exp"
# gamma: 0.9
##### poly:base_lr (1 - iter/max_iter) ^ (power)
# lr_policy: "poly"
# power: 0.9
##### sigmoid:base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
# lr_policy: "sigmoid"
# gamma: 0.9
#momentum: 0.9
momentum: 0.9
momentum2: 0.999
delta: 1e-8
lr_policy: "fixed"

display: 100
max_iter: 50000
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "./stage1/sgd_DeepBit1024_alex_stage1"
solver_mode: GPU```

### RMSProp

```net: "examples/mnist/lenet_train_test.prototxt"
test_iter: 100
test_interval: 500
base_lr: 1.0
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
display: 100
max_iter: 10000
snapshot: 5000
solver_mode: GPU
type: "RMSProp"
rms_decay: 0.98```