Mercurial > pylearn
comparison weights.py @ 490:4f3c66146f17
Moved weights.py out of sandbox
author | Joseph Turian <turian@gmail.com> |
---|---|
date | Tue, 28 Oct 2008 10:54:26 -0400 |
parents | sandbox/weights.py@3daabc7f94ff |
children |
comparison
equal
deleted
inserted
replaced
489:bb6bdd3b7ff3 | 490:4f3c66146f17 |
---|---|
1 """ | |
2 Routine to initialize weights. | |
3 | |
4 @note: We assume that numpy.random.seed() has already been performed. | |
5 """ | |
6 | |
7 from math import pow, sqrt | |
8 import numpy.random | |
9 | |
10 sqrt3 = sqrt(3.0) | |
11 def random_weights(nin, nout, scale_by=1./sqrt3, power=0.5): | |
12 """ | |
13 Generate an initial weight matrix with nin inputs (rows) and nout | |
14 outputs (cols). | |
15 Each weight is chosen uniformly at random to be in range: | |
16 [-scale_by*sqrt(3)/pow(nin,power), +scale_by*sqrt(3)/pow(nin,power)] | |
17 @note: Play with scale_by, but reasonable values are <=1, maybe 1./sqrt3 | |
18 power=0.5 is strongly recommanded (see below). | |
19 | |
20 Suppose these weights w are used in dot products as follows: | |
21 output = w' input | |
22 If w ~ Uniform(-r,r) and Var[input_i]=1 and x_i's are independent, then | |
23 Var[w]=r2/3 | |
24 Var[output] = Var[ sum_{i=1}^d w_i input_i] = d r2 / 3 | |
25 To make sure that variance is not changed after the dot product, | |
26 we therefore want Var[output]=1 and r = sqrt(3)/sqrt(d). This choice | |
27 corresponds to the default values scale_by=sqrt(3) and power=0.5. | |
28 More generally we see that Var[output] = Var[input] * scale_by. | |
29 | |
30 Now, if these are weights in a deep multi-layer neural network, | |
31 we would like the top layers to be initially more linear, so as to let | |
32 gradients flow back more easily (this is an explanation by Ronan Collobert). | |
33 To achieve this we want scale_by smaller than 1. | |
34 Ronan used scale_by=1/sqrt(3) (by mistake!) and got better results than scale_by=1 | |
35 in the experiment of his ICML'2008 paper. | |
36 Note that if we have a multi-layer network, ignoring the effect of the tanh non-linearity, | |
37 the variance of the layer outputs would go down roughly by a factor 'scale_by' at each | |
38 layer (making the layers more linear as we go up towards the output). | |
39 """ | |
40 return (numpy.random.rand(nin, nout) * 2.0 - 1) * scale_by * sqrt3 / pow(nin,power) |