view weights.py @ 507:b8e6de17eaa6

modifs to smallNorb
author James Bergstra <bergstrj@iro.umontreal.ca>
date Wed, 29 Oct 2008 18:06:49 -0400
parents 4f3c66146f17
children
line wrap: on
line source

"""
Routine to initialize weights.

@note: We assume that numpy.random.seed() has already been performed.
"""

from math import pow, sqrt
import numpy.random

sqrt3 = sqrt(3.0)
def random_weights(nin, nout, scale_by=1./sqrt3, power=0.5):
    """
    Generate an initial weight matrix with nin inputs (rows) and nout
    outputs (cols).
    Each weight is chosen uniformly at random to be in range:
        [-scale_by*sqrt(3)/pow(nin,power), +scale_by*sqrt(3)/pow(nin,power)]
    @note: Play with scale_by, but reasonable values are <=1, maybe 1./sqrt3
    power=0.5 is strongly recommanded (see below).

    Suppose these weights w are used in dot products as follows:
       output = w' input
    If w ~ Uniform(-r,r) and Var[input_i]=1 and x_i's are independent, then
       Var[w]=r2/3
       Var[output] = Var[ sum_{i=1}^d w_i input_i] = d r2 / 3
    To make sure that variance is not changed after the dot product,
    we therefore want Var[output]=1 and r = sqrt(3)/sqrt(d).  This choice
    corresponds to the default values scale_by=sqrt(3) and power=0.5.
    More generally we see that Var[output] = Var[input] * scale_by.

    Now, if these are weights in a deep multi-layer neural network,
    we would like the top layers to be initially more linear, so as to let
    gradients flow back more easily (this is an explanation by Ronan Collobert).
    To achieve this we want scale_by smaller than 1.
    Ronan used scale_by=1/sqrt(3) (by mistake!) and got better results than scale_by=1
    in the experiment of his ICML'2008 paper.
    Note that if we have a multi-layer network, ignoring the effect of the tanh non-linearity,
    the variance of the layer outputs would go down roughly by a factor 'scale_by' at each
    layer (making the layers more linear as we go up towards the output).
    """
    return (numpy.random.rand(nin, nout) * 2.0 - 1) * scale_by * sqrt3 / pow(nin,power)